Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools

Jie Cao; Jifan Yu; Zhanxin Hao

arxiv: 2604.04370 · v1 · submitted 2026-04-06 · 💻 cs.HC

Decoding Student Dialogue: A Multi-Dimensional Comparison and Bias Analysis of Large Language Models as Annotation Tools

Jie Cao , Zhanxin Hao , Jifan Yu This is my paper

Pith reviewed 2026-05-10 20:02 UTC · model grok-4.3

classification 💻 cs.HC

keywords educational dialogue annotationlarge language modelsprompting strategiesbias analysisstudent learning processesmulti-agent promptingcognitive dimensionsannotation accuracy

0 comments

The pith

Large language models annotate student dialogues with highest accuracy using multi-agent prompting, though differences lack statistical significance and biases vary by dimension and context.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests GPT-5.2 and Gemini-3 on annotating educational dialogues across affective, cognitive, meta-cognitive, and behavioral dimensions using few-shot, single-agent, and multi-agent prompting. Multi-agent prompting produced the top accuracy scores, yet these did not differ significantly from other methods. Accuracy was markedly higher on K-12 data than university data and showed subject-specific patterns, performing best on affective content and worst on cognitive content. The models displayed four recurring bias patterns, including consistent optimism in affective ratings for Gemini-3, underestimation in mathematics cognitive codes paired with overestimation in psychology, overestimation in meta-cognitive categories, and frequent confusion among behavioral labels such as questions and negotiations.

Core claim

While multi-agent prompting achieved the highest accuracy in coding student dialogues, the gains did not reach statistical significance. Accuracy proved highly context-dependent, performing significantly better on K-12 than university datasets and varying by discipline within the same level. Performance was strongest in the affective dimension and weakest in the cognitive dimension. Four bias patterns emerged: Gemini-3 showed consistent optimistic bias in affective annotations; cognitive dimension bias was domain-specific with underestimation in mathematics and overestimation in psychology; both models tended to overestimate in the meta-cognitive dimension; and behavioral categories such as,

What carries the argument

The three prompting strategies (few-shot, single-agent, multi-agent reflection) applied to four coding dimensions for comparing GPT-5.2 and Gemini-3 annotation performance against human ground truth.

If this is right

Multi-agent reflection prompting should be the default choice when deploying LLMs for educational dialogue annotation.
Automated annotations will be more reliable in K-12 settings than in university-level data.
Targeted bias correction is required for affective and cognitive dimensions before wide use.
Behavioral categories need additional prompting or post-processing to reduce misclassification.
Context-sensitive deployment is necessary rather than uniform application across subjects and levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These findings imply that fully automated annotation pipelines for student dialogue still require human review to catch directional biases.
The context-dependence suggests the same models may show different reliability when applied to non-educational dialogue corpora.
Future tests could measure whether fine-tuning on domain-specific data reduces the observed under- and over-estimation patterns.
Integration of LLM outputs with lightweight human verification steps could address the lack of statistical significance in accuracy gains.

Load-bearing premise

The human-coded annotations used as ground truth are accurate and unbiased across dimensions, subjects, and educational levels.

What would settle it

A replication on an independent dataset with multiple human coders reporting high inter-rater reliability that finds either statistically significant accuracy differences or absent bias patterns would falsify the reported results.

Figures

Figures reproduced from arXiv: 2604.04370 by Jie Cao, Jifan Yu, Zhanxin Hao.

**Figure 1.** Figure 1: The feature of the three adopted prompt engineering methods 3.4 Data analysis To address RQ1, we first calculated the overall annotation Accuracy. Due to perfect collinearity between educational level and subject, we employed a two-step Generalized Linear Mixed Model (GLMM) strategy to analyze utterance-level annotation correctness (binary: correct/incorrect). All models included random intercepts for Row… view at source ↗

**Figure 2.** Figure 2: Annotation accuracy by subjects 4 Results 4.1 Contextual and Dimensional Variations in Accuracy, but Not Prompting Methods Descriptively, annotation accuracy slightly improved as prompt complexity increased: from the Few-Shot (Gemini-3: 79.2%; GPT-5.2: 82.4%) to Single-Agent (79.5% and 83.4%) and Multi-Agent (79.7% and 83.9%). However, GLMM results ( [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The annotation bias (affective, cognitive, meta-cognitive) of LLMs 5 Discussion and Conclusion This study evaluates LLMs (Gemini-3 and GPT-5.2) as automated annotators for student-AI educational dialogues. Our results indicate that increasing prompt complexity yielded no statistically significant improvements over the baseline. Consequently, the few-shot approach remains a cost-effective alternative in r… view at source ↗

read the original abstract

Educational dialogue is critical for decoding student learning processes, yet manual annotation remains time-consuming. This study evaluates the efficacy of GPT-5.2 and Gemini-3 using three prompting strategies (few-shot, single-agent, and multi-agent reflection) across diverse subjects, educational levels, and four coding dimensions. Results indicate that while multi-agent prompting achieved the highest accuracy, the results did not reach statistical significance. Accuracy proved highly context-dependent, with significantly higher performance in K-12 datasets compared to university-level data, alongside disciplinary variations within the same educational level. Performance peaked in the affective dimension but remained lowest in the cognitive dimension. Furthermore, analysis revealed four bias patterns: (1) Gemini-3 exhibited a consistent optimistic bias in the affective dimension across all subjects; (2) the cognitive dimension displayed domain-specific directional bias, characterized by systematic underestimation in Mathematics versus overestimation in Psychology; (3) both models are more prone to overestimation than underestimation within the meta-cognitive dimension; and (4) behavioral categories such as question, negotiation, and statements were frequently misclassified. These results underscore the need for context-sensitive deployment and targeted mitigation of directional biases in automated annotation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper maps some clear context-dependent accuracy gaps and four directional bias patterns in LLM dialogue annotation, but unreported human inter-rater reliability leaves the bias attributions on shaky ground.

read the letter

The main things to know are that multi-agent prompting edges out the others for accuracy on student dialogue coding but the difference is not statistically significant, performance is markedly better on K-12 data than university data, and the models show four specific bias patterns tied to dimensions and subjects. Those patterns are the freshest empirical bits: Gemini's optimistic tilt in affective coding, the math-under versus psychology-over split in cognitive coding, the general overestimation in meta-cognitive, and frequent misfires on behavioral categories like questions and negotiation. The work does a straightforward job running the three prompting conditions across two models, multiple subjects, and two educational levels, and it reports the non-significant top result without overclaiming. That honesty and the multi-dimensional breakdown are the useful parts for anyone who actually annotates dialogues. The soft spot is the ground truth. All the bias claims and accuracy comparisons rest on human-coded labels treated as fixed, yet the abstract gives no inter-annotator agreement numbers, no disagreement protocol, and no sense of how stable the human codes are across dimensions or subjects. Without that, the directional biases could partly reflect human noise or systematic human leanings rather than model properties alone. The datasets are also narrow, so the K-12 versus university gap and subject differences may not travel far. This is aimed at learning-analytics people who want to speed up dialogue coding for feedback tools or student modeling. A reader already working on LLM annotation pipelines would pick up concrete warnings about where the models drift. It is worth sending to a serious referee because the experimental frame is simple and the practical question matters, but any review should press hard for the missing reliability metrics and clearer statistical reporting on the comparisons.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates GPT-5.2 and Gemini-3 as tools for annotating educational student dialogues using few-shot, single-agent, and multi-agent reflection prompting strategies. It compares model accuracy against human annotations across subjects, educational levels (K-12 vs. university), and four coding dimensions (affective, cognitive, meta-cognitive, behavioral), reporting that multi-agent prompting yields the highest accuracy (though not statistically significant), with performance varying by context and dimension, and identifying four specific directional bias patterns in the LLM outputs.

Significance. If the results hold after addressing ground-truth validation, the work is significant for HCI and educational technology as it offers a multi-dimensional empirical comparison of LLMs in a practical annotation task. It highlights context dependencies and directional biases that could inform more reliable automated tools for analyzing learning dialogues, while underscoring risks of over-reliance on such systems without mitigation strategies.

major comments (2)

Methods section (human annotation protocol): The four reported bias patterns and accuracy comparisons (e.g., Gemini-3 optimistic bias in affective dimension, domain-specific under/overestimation in cognitive, overestimation in meta-cognitive, misclassifications in behavioral categories) are computed against human-coded labels treated as ground truth. No inter-annotator agreement metrics (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha), number of coders, or disagreement-resolution protocol are referenced. Without these, the directional bias attributions cannot be reliably distinguished from human label noise or systematic human bias, directly undermining the central claims about model-specific biases.
Results section (statistical comparisons): The abstract states 'significantly higher performance in K-12 datasets compared to university-level data' and notes non-significance for multi-agent prompting, but without reported sample sizes per subgroup, exact statistical tests, p-values, or effect sizes, it is unclear whether the context-dependency claims are robust or driven by small/unbalanced samples.

minor comments (2)

Abstract: Model names 'GPT-5.2' and 'Gemini-3' should be verified for accuracy and consistency with standard nomenclature (e.g., specific release versions) throughout the manuscript.
Discussion: The implications for 'context-sensitive deployment' could be strengthened with concrete examples of how practitioners might detect or correct the identified bias patterns in new datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for strengthening the methodological transparency and statistical reporting in our manuscript. We address each major comment below.

read point-by-point responses

Referee: Methods section (human annotation protocol): The four reported bias patterns and accuracy comparisons (e.g., Gemini-3 optimistic bias in affective dimension, domain-specific under/overestimation in cognitive, overestimation in meta-cognitive, misclassifications in behavioral categories) are computed against human-coded labels treated as ground truth. No inter-annotator agreement metrics (Cohen's kappa, Fleiss' kappa, or Krippendorff's alpha), number of coders, or disagreement-resolution protocol are referenced. Without these, the directional bias attributions cannot be reliably distinguished from human label noise or systematic human bias, directly undermining the central claims about model-specific biases.

Authors: We agree this is a valid concern and a limitation in the current draft. The human annotations were performed by a single expert coder with domain expertise in educational dialogue analysis, using a pre-established coding scheme derived from prior literature; no multiple independent coders or formal disagreement-resolution protocol were employed due to the resource-intensive nature of the task. We will revise the Methods section to explicitly report the number of coders (one), provide additional details on the annotation protocol and coding manual, and add a dedicated limitations paragraph discussing the absence of inter-annotator agreement metrics. We will also qualify the bias findings as relative to the single-coder ground truth and note that multi-annotator validation would be valuable in follow-up work. This revision will not alter the comparative LLM results but will improve interpretability. revision: yes
Referee: Results section (statistical comparisons): The abstract states 'significantly higher performance in K-12 datasets compared to university-level data' and notes non-significance for multi-agent prompting, but without reported sample sizes per subgroup, exact statistical tests, p-values, or effect sizes, it is unclear whether the context-dependency claims are robust or driven by small/unbalanced samples.

Authors: We acknowledge that the current version lacks sufficient statistical detail to fully support the reported claims. We will expand the Results section (and update the abstract if needed for consistency) to include: (1) exact sample sizes per subgroup (K-12 vs. university, and breakdowns by subject and dimension), (2) the specific statistical tests used (e.g., chi-squared tests for accuracy comparisons across conditions), (3) all p-values, and (4) effect sizes (such as Cramer's V for categorical comparisons). These additions will allow readers to evaluate the robustness of the context-dependency and non-significance findings directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison against external human annotations

full rationale

This paper conducts a direct empirical evaluation of two LLMs (GPT-5.2 and Gemini-3) against human-coded dialogue annotations using three prompting strategies. Accuracy and bias patterns are computed as straightforward statistical comparisons (e.g., accuracy rates, directional over/underestimation counts) to the provided human labels treated as ground truth. No equations, derivations, fitted parameters, or self-referential definitions appear in the abstract or described methods. There are no self-citations invoked as load-bearing uniqueness theorems, no ansatzes smuggled via prior work, and no renaming of known results as novel derivations. The central claims rest on external data (human annotations) rather than reducing to the paper's own inputs by construction. The absence of inter-annotator agreement metrics is a validity concern for the bias attributions but does not constitute circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation study with no mathematical model or derivation; relies on standard assumptions about data quality and statistical interpretation.

axioms (1)

domain assumption Human annotations provide an unbiased and accurate ground truth for measuring LLM performance.
The entire accuracy and bias analysis depends on treating human labels as the reference standard without reported checks for human annotator variability or error.

pith-pipeline@v0.9.0 · 5512 in / 1449 out tokens · 76459 ms · 2026-05-10T20:02:06.239353+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

[1]

JMIR Form

Annan, A., Eiden, A.L., Wang, D., Du, J., Rastegar-Mojarad, M., Nomula, V.K., Wang, X.: Evaluating large language models for sentiment analysis and hesitancy analysis on vaccine posts from social media: Qualitative study. JMIR Form. Res. 9, e64723 (2025)

work page 2025
[2]

Handbook 1: Cognitive domain

Bloom, B.S., Engelhart, M.D., Furst, E.J., Hill, W.H., Krathwohl, D.R., et al.: Tax- onomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain. Longman New York (1956)

work page 1956
[3]

In: AIED’25

Cao, J., Zhao, C.Q., Chen, X., Wang, S., Schunn, C., Koedinger, K.R., Lin, J.: From first draft to final insight: a multi-agent approach for feedback generation. In: AIED’25. pp. 163–176. Springer (2025)

work page 2025
[4]

Dang, B., Huynh, L., Gul, F., Rosé, C., Järvelä, S., Nguyen, A.: Human–ai col- laborative learning in mixed reality: Examining the cognitive and socio-emotional interactions. Br. J. Educ. Technol. (2025)

work page 2025
[5]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Echterhoff, J.M., Liu, Y., Alessa, A., McAuley, J., He, Z.: Cognitive bias in decision-making with LLMs. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) EMNLP’2024. pp. 12640–12653. ACL, Miami, Florida, USA (Nov 2024)

work page 2024
[6]

Edwards, J., Nguyen, A., Lämsä, J., Sobocinski, M., Whitehead, R., Dang, B., Roberts, A.S., Järvelä, S.: Human-ai collaboration: Designing artificial agents to facilitate socially shared regulation among learners. Br. J. Educ. Technol.56(2), 712–733 (2025)

work page 2025
[7]

arXiv preprint arXiv:2403.08272 (2024)

Han, J., Yoo, H., Myung, J., Kim, M., Lee, T.Y., Ahn, S.Y., Oh, A.: Recipe4u: Student-chatgpt interaction dataset in efl writing education. arXiv preprint arXiv:2403.08272 (2024)

work page arXiv 2024
[8]

Hao, Z., Cao, J., Li, R., Yu, J., Liu, Z., Zhang, Y.: Mapping student-AI interaction dynamics in multi-agent learning environments: Supporting personalized learning and reducing performance gaps. Comput. Educ.241, 105472 (2026)

work page 2026
[9]

arXiv preprint arXiv:2509.09125 (2025) Decoding Student Dialogue: Large Language Models as Annotation Tools 9

He, L., Xu, J.: Automated classification of tutors’ dialogue acts using generative ai: A case study using the cima corpus. arXiv preprint arXiv:2509.09125 (2025) Decoding Student Dialogue: Large Language Models as Annotation Tools 9

work page arXiv 2025
[10]

Hennessy, S., Rojas-Drummond, S., Higham, R., Márquez, A.M., Maine, F., Ríos, R.M., García-Carrión, R., Torreblanca, O., Barrera, M.J.: Developing a coding scheme for analysing classroom dialogue across educational contexts. Learn. Cult. Soc. Interact.9, 16–44 (2016)

work page 2016
[11]

Howe, C., Hennessy, S., Mercer, N., Vrikki, M., Wheatley, L.: Teacher–student dialogue during classroom teaching: Does it really impact on student outcomes? J. Learn. Sci.28(4-5), 462–512 (2019)

work page 2019
[12]

In: AIED’25

Jiang, Y., Hao, J., Cui, W., Kerzabi, E., Kyllonen, P.: Uncovering transferable collaboration patterns across tasks using large language models. In: AIED’25. pp. 320–335. Springer (2025)

work page 2025
[13]

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learn. Individ. Differ.103, 102274 (2023)

work page 2023
[14]

In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

Lin, L., Wang, L., Guo, J., Wong, K.F.: Investigating bias in LLM-based bias detec- tion: Disparities between LLMs and human perception. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proc. Int. Conf. Comput. Linguist. pp. 10634–10649. ACL, Abu Dhabi, UAE (Jan 2025)

work page 2025
[15]

Liu, Z., Xing, W., Ngo, B., Jiao, X., Jiang, S., Li, C.: Engagement patterns of middle school students with ai teachable agents in mathematics learning. Sci. Rep. 15(1), 40971 (2025)

work page 2025
[16]

Long, Y., Luo, H., Zhang, Y.: Evaluating large language models in analysing class- room dialogue. npj Sci. Learn.9(1), 60 (2024)

work page 2024
[17]

EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

Miller, P., Dicerbo, K.: Llm based math tutoring: Challenges and dataset. EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

work page 2024
[18]

Muhonen, H., Pakarinen, E., Poikkeus, A.M., Lerkkanen, M.K., Rasku-Puttonen, H.: Quality of educational dialogue and association with students’ academic per- formance. Learn. Instr.55, 67–79 (2018)

work page 2018
[19]

Muhonen, H., Rasku-Puttonen, H., Pakarinen, E., Poikkeus, A.M., Lerkkanen, M.K.: Knowledge-building patterns in educational dialogue. Int. J. Educ. Res.81, 25–37 (2017)

work page 2017
[20]

Nguyen, H., Hayward, J.: Applying generative artificial intelligence to critiquing science assessments. J. Sci. Educ. Technol.34(1), 199–214 (2025)

work page 2025
[21]

Nguyen, H., Nguyen, A.: Reflective practices and self-regulated learning in de- signing with generative artificial intelligence: An ordered network analysis. J. Sci. Educ. Technol.34(5), 1178–1192 (2025)

work page 2025
[22]

Qian, K., Liu, S., Li, T., Raković, M., Li, X., Guan, R., Molenaar, I., Nawaz, S., Swiecki, Z., Yan, L., et al.: Towards reliable generative ai-driven scaffolding: Reducing hallucinations and enhancing quality in self-regulated learning support. Comput. Educ. p. 105448 (2025)

work page 2025
[23]

NeurIPS’2336, 8634–8652 (2023)

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. NeurIPS’2336, 8634–8652 (2023)

work page 2023
[24]

In: Chiruzzo, L., Ritter, A., Wang, L

Zhang, Z., Zhang-Li, D., Yu, J., Gong, L., Zhou, J., Hao, Z., Jiang, J., Cao, J., Liu, H., Liu, Z., Hou, L., Li, J.: Simulating classroom education with LLM-empowered agents. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) NAACL’25. pp. 10364–10379. ACL, Albuquerque, New Mexico (Apr 2025)

work page 2025
[25]

NeurIPS’2336, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS’2336, 46595–46623 (2023)

work page 2023
[26]

Theory Pract

Zimmerman, B.J.: Becoming a self-regulated learner: An overview. Theory Pract. 41(2), 64–70 (2002)

work page 2002

[1] [1]

JMIR Form

Annan, A., Eiden, A.L., Wang, D., Du, J., Rastegar-Mojarad, M., Nomula, V.K., Wang, X.: Evaluating large language models for sentiment analysis and hesitancy analysis on vaccine posts from social media: Qualitative study. JMIR Form. Res. 9, e64723 (2025)

work page 2025

[2] [2]

Handbook 1: Cognitive domain

Bloom, B.S., Engelhart, M.D., Furst, E.J., Hill, W.H., Krathwohl, D.R., et al.: Tax- onomy of educational objectives: The classification of educational goals. Handbook 1: Cognitive domain. Longman New York (1956)

work page 1956

[3] [3]

In: AIED’25

Cao, J., Zhao, C.Q., Chen, X., Wang, S., Schunn, C., Koedinger, K.R., Lin, J.: From first draft to final insight: a multi-agent approach for feedback generation. In: AIED’25. pp. 163–176. Springer (2025)

work page 2025

[4] [4]

Dang, B., Huynh, L., Gul, F., Rosé, C., Järvelä, S., Nguyen, A.: Human–ai col- laborative learning in mixed reality: Examining the cognitive and socio-emotional interactions. Br. J. Educ. Technol. (2025)

work page 2025

[5] [5]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Echterhoff, J.M., Liu, Y., Alessa, A., McAuley, J., He, Z.: Cognitive bias in decision-making with LLMs. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) EMNLP’2024. pp. 12640–12653. ACL, Miami, Florida, USA (Nov 2024)

work page 2024

[6] [6]

Edwards, J., Nguyen, A., Lämsä, J., Sobocinski, M., Whitehead, R., Dang, B., Roberts, A.S., Järvelä, S.: Human-ai collaboration: Designing artificial agents to facilitate socially shared regulation among learners. Br. J. Educ. Technol.56(2), 712–733 (2025)

work page 2025

[7] [7]

arXiv preprint arXiv:2403.08272 (2024)

Han, J., Yoo, H., Myung, J., Kim, M., Lee, T.Y., Ahn, S.Y., Oh, A.: Recipe4u: Student-chatgpt interaction dataset in efl writing education. arXiv preprint arXiv:2403.08272 (2024)

work page arXiv 2024

[8] [8]

Hao, Z., Cao, J., Li, R., Yu, J., Liu, Z., Zhang, Y.: Mapping student-AI interaction dynamics in multi-agent learning environments: Supporting personalized learning and reducing performance gaps. Comput. Educ.241, 105472 (2026)

work page 2026

[9] [9]

arXiv preprint arXiv:2509.09125 (2025) Decoding Student Dialogue: Large Language Models as Annotation Tools 9

He, L., Xu, J.: Automated classification of tutors’ dialogue acts using generative ai: A case study using the cima corpus. arXiv preprint arXiv:2509.09125 (2025) Decoding Student Dialogue: Large Language Models as Annotation Tools 9

work page arXiv 2025

[10] [10]

Hennessy, S., Rojas-Drummond, S., Higham, R., Márquez, A.M., Maine, F., Ríos, R.M., García-Carrión, R., Torreblanca, O., Barrera, M.J.: Developing a coding scheme for analysing classroom dialogue across educational contexts. Learn. Cult. Soc. Interact.9, 16–44 (2016)

work page 2016

[11] [11]

Howe, C., Hennessy, S., Mercer, N., Vrikki, M., Wheatley, L.: Teacher–student dialogue during classroom teaching: Does it really impact on student outcomes? J. Learn. Sci.28(4-5), 462–512 (2019)

work page 2019

[12] [12]

In: AIED’25

Jiang, Y., Hao, J., Cui, W., Kerzabi, E., Kyllonen, P.: Uncovering transferable collaboration patterns across tasks using large language models. In: AIED’25. pp. 320–335. Springer (2025)

work page 2025

[13] [13]

Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al.: Chatgpt for good? on opportunities and challenges of large language models for education. Learn. Individ. Differ.103, 102274 (2023)

work page 2023

[14] [14]

In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S

Lin, L., Wang, L., Guo, J., Wong, K.F.: Investigating bias in LLM-based bias detec- tion: Disparities between LLMs and human perception. In: Rambow, O., Wanner, L., Apidianaki, M., Al-Khalifa, H., Eugenio, B.D., Schockaert, S. (eds.) Proc. Int. Conf. Comput. Linguist. pp. 10634–10649. ACL, Abu Dhabi, UAE (Jan 2025)

work page 2025

[15] [15]

Liu, Z., Xing, W., Ngo, B., Jiao, X., Jiang, S., Li, C.: Engagement patterns of middle school students with ai teachable agents in mathematics learning. Sci. Rep. 15(1), 40971 (2025)

work page 2025

[16] [16]

Long, Y., Luo, H., Zhang, Y.: Evaluating large language models in analysing class- room dialogue. npj Sci. Learn.9(1), 60 (2024)

work page 2024

[17] [17]

EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

Miller, P., Dicerbo, K.: Llm based math tutoring: Challenges and dataset. EdArXiv Preprints (2024),https://osf.io/preprints/edarxiv/5zwv3_v1

work page 2024

[18] [18]

Muhonen, H., Pakarinen, E., Poikkeus, A.M., Lerkkanen, M.K., Rasku-Puttonen, H.: Quality of educational dialogue and association with students’ academic per- formance. Learn. Instr.55, 67–79 (2018)

work page 2018

[19] [19]

Muhonen, H., Rasku-Puttonen, H., Pakarinen, E., Poikkeus, A.M., Lerkkanen, M.K.: Knowledge-building patterns in educational dialogue. Int. J. Educ. Res.81, 25–37 (2017)

work page 2017

[20] [20]

Nguyen, H., Hayward, J.: Applying generative artificial intelligence to critiquing science assessments. J. Sci. Educ. Technol.34(1), 199–214 (2025)

work page 2025

[21] [21]

Nguyen, H., Nguyen, A.: Reflective practices and self-regulated learning in de- signing with generative artificial intelligence: An ordered network analysis. J. Sci. Educ. Technol.34(5), 1178–1192 (2025)

work page 2025

[22] [22]

Qian, K., Liu, S., Li, T., Raković, M., Li, X., Guan, R., Molenaar, I., Nawaz, S., Swiecki, Z., Yan, L., et al.: Towards reliable generative ai-driven scaffolding: Reducing hallucinations and enhancing quality in self-regulated learning support. Comput. Educ. p. 105448 (2025)

work page 2025

[23] [23]

NeurIPS’2336, 8634–8652 (2023)

Shinn,N.,Cassano,F.,Gopinath,A.,Narasimhan,K.,Yao,S.:Reflexion:Language agents with verbal reinforcement learning. NeurIPS’2336, 8634–8652 (2023)

work page 2023

[24] [24]

In: Chiruzzo, L., Ritter, A., Wang, L

Zhang, Z., Zhang-Li, D., Yu, J., Gong, L., Zhou, J., Hao, Z., Jiang, J., Cao, J., Liu, H., Liu, Z., Hou, L., Li, J.: Simulating classroom education with LLM-empowered agents. In: Chiruzzo, L., Ritter, A., Wang, L. (eds.) NAACL’25. pp. 10364–10379. ACL, Albuquerque, New Mexico (Apr 2025)

work page 2025

[25] [25]

NeurIPS’2336, 46595–46623 (2023)

Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. NeurIPS’2336, 46595–46623 (2023)

work page 2023

[26] [26]

Theory Pract

Zimmerman, B.J.: Becoming a self-regulated learner: An overview. Theory Pract. 41(2), 64–70 (2002)

work page 2002