arxiv: 2602.01015 · v2 · submitted 2026-02-01 · 💻 cs.CL · cs.CY

Recognition: 1 theorem link

· Lean Theorem

Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Conrad Borchers , Jill-J\^enn Vie , Roger Azevedo

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:15 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords large language modelsthink-aloud protocolstutoring systemsnovice reasoningmetacognitionchemistry educationAI tutoringlearner modeling

0 comments

The pith

Large language models generate think-aloud reasoning that is more coherent, verbose, and confident than human students in chemistry tutoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can accurately simulate the imperfect, step-by-step reasoning of novice students. It compares model outputs to 630 real think-aloud utterances from chemistry problem-solving sessions. Models produce smoother and longer responses that show less variation than humans, especially when given more context about the problem. They also tend to predict higher success rates for learners than what actually occurs. This points to basic limits in using current LLMs to model the messy nature of human learning.

Core claim

When asked to continue think-aloud protocols as if they were novice learners solving multi-step chemistry problems, GPT-4.1 produces continuations that are systematically more coherent, verbose, and less variable than actual human utterances. These differences grow stronger when the model receives richer problem-solving context in the prompt. The models also consistently overestimate how likely learners are to succeed at each step.

What carries the argument

Direct comparison of LLM-generated think-aloud continuations against a baseline of 630 human utterances, using both minimal and extended contextual prompting, plus evaluation of step-level success predictions.

If this is right

AI tutoring systems based on LLMs may assume more organized thinking than students actually show.
Performance forecasts from such models will be biased toward overestimating learner success.
Providing more problem context to the model widens the gap from real student behavior.
LLM outputs lack the expressions of uncertainty and working-memory limits seen in novice reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adaptive systems might need added mechanisms to introduce realistic variability and doubt into simulated student responses.
Training data focused on expert solutions likely contributes to the overly polished outputs observed.
Applying the same evaluation to other subjects could reveal whether the pattern holds beyond chemistry.
Future models fine-tuned on learner data rather than expert solutions may close some of these gaps.

Load-bearing premise

The differences between model and human reasoning stem mainly from the models' training on expert solutions rather than from details of the prompting method or the size of the human baseline set.

What would settle it

Collect new think-aloud data from students and test whether blind human judges can reliably tell LLM-generated utterances apart from human ones, or whether the models' success predictions match actual step outcomes in fresh tutoring sessions.

Figures

Figures reproduced from arXiv: 2602.01015 by Conrad Borchers, Jill-J\^enn Vie, Roger Azevedo.

**Figure 1.** Figure 1: Average linguistic properties of ground-truth learner utterances and modelgenerated reasoning under simple and complex prompting [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs like GPT-4.1 produce think-alouds that are too coherent, verbose, and confident relative to real student data on chemistry problems, with richer context making the mismatch worse.

read the letter

The main point from this paper is that GPT-4.1 generates think-aloud reasoning that is more coherent, verbose, and confident than what real students produce on the same chemistry problems, and richer context in the prompt makes those differences bigger while the model overestimates learner success. They do something useful by grounding the comparison in 630 actual human think-aloud utterances collected during multi-step tutoring with logs of hints, attempts, and context. Comparing minimal versus extended prompting shows the patterns consistently. This moves beyond just accuracy metrics to look at the style of reasoning, which matters for tutoring applications. The work is clear about the practical implication: if LLMs can't model the imperfect, variable novice process, they may not support adaptive feedback well. The authors point to training on expert-like data as the source, which fits with the over-coherence they see. That said, the attribution to training data feels a bit quick. It could partly come from the prompting method itself or from how the human baseline was chosen and filtered. Without seeing the full methods, it's unclear if they checked inter-rater reliability for any coding or if the conditions were set up to avoid bias. The stress-test note raises a fair point about possible confounding there. This is the kind of paper that would interest people building or studying AI tutoring systems. It gives a specific evaluation framework that others could use or build on. The evidence is observational but tied to real data, so it deserves a serious referee to check the details and see if the claims hold up under scrutiny. I'd send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates LLMs (focusing on GPT-4.1) as simulators of novice reasoning by generating think-aloud continuations for 630 human utterances from multi-step chemistry tutoring sessions that include logs of hint use, attempts, and context. It compares outputs under minimal versus extended contextual prompting, reports that LLM reasoning is systematically over-coherent, verbose, and low-variability relative to human data, finds stronger mismatches with richer context, and shows consistent overestimation of learner success; these patterns are attributed to training on expert solutions lacking affect and working-memory constraints.

Significance. If the central patterns hold after controls, the work provides a useful empirical framework for assessing how well generative models capture the fragmented, affect-laden nature of novice problem-solving, with direct relevance to AI tutoring system design. The direct use of logged student data as a baseline is a concrete strength that moves beyond accuracy-only evaluations.

major comments (3)

[Methods] Methods section: inter-rater reliability statistics for the coding of the 630 human utterances are not reported, which is required to establish the stability of the human baseline against which LLM outputs are compared.
[Discussion] Discussion: the causal attribution of over-coherence and verbosity primarily to training data on expert solutions is load-bearing for the epistemic-limitation claim, yet no ablation tests alternative prompt phrasings (e.g., explicit novice-fragility instructions) or alternative human corpora matched on difficulty and length, leaving open confounding by the chosen prompting setup.
[Results] Results: the intensification of effects under richer context is central, but the manuscript does not detail how LLM prompts were matched to the human baseline on task difficulty, hint usage, and utterance length, weakening the claim that context richness itself drives the mismatch.

minor comments (2)

[Abstract] Abstract and throughout: model naming is inconsistent (GPT-4.1 vs. GPT-4); standardize and specify the exact checkpoint used.
[Figures] Figure captions: ensure all panels clearly label minimal vs. extended prompting conditions and report exact sample sizes per condition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback, which has helped us clarify several aspects of our methodology and strengthen the claims in the manuscript. We address each major comment below.

read point-by-point responses

Referee: [Methods] Methods section: inter-rater reliability statistics for the coding of the 630 human utterances are not reported, which is required to establish the stability of the human baseline against which LLM outputs are compared.

Authors: We thank the referee for pointing this out. The coding of the 630 human utterances was performed by two independent coders, and we have now added the inter-rater reliability statistics (Cohen's kappa = 0.82) to the Methods section. This confirms the stability of the human baseline. revision: yes
Referee: [Discussion] Discussion: the causal attribution of over-coherence and verbosity primarily to training data on expert solutions is load-bearing for the epistemic-limitation claim, yet no ablation tests alternative prompt phrasings (e.g., explicit novice-fragility instructions) or alternative human corpora matched on difficulty and length, leaving open confounding by the chosen prompting setup.

Authors: We acknowledge that direct ablations with alternative prompts would provide stronger evidence. However, our attribution is grounded in the systematic patterns observed across minimal and extended contexts, which align with known characteristics of LLM training data. In the revised manuscript, we have expanded the Discussion to explicitly discuss potential confounding factors and alternative explanations, including the possibility of prompt sensitivity. We note that testing alternative human corpora would require additional data collection beyond the scope of this study, but our use of logged student data provides a strong baseline. revision: partial
Referee: [Results] Results: the intensification of effects under richer context is central, but the manuscript does not detail how LLM prompts were matched to the human baseline on task difficulty, hint usage, and utterance length, weakening the claim that context richness itself drives the mismatch.

Authors: We apologize for the lack of detail. The LLM prompts were constructed to match the human baseline exactly on the problem context, including task difficulty (same problems), hint usage logs, and preceding utterance lengths. We have now added a detailed description in the Results section (new subsection on prompt construction) explaining the matching procedure, including examples of the minimal and extended prompts used. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparisons between LLM outputs and human data

full rationale

The paper conducts side-by-side analysis of GPT-4.1 continuations against 630 human think-aloud utterances collected from chemistry tutoring sessions, using both minimal and extended prompting. All central claims (over-coherence, verbosity, reduced variability, and overestimation of learner success) rest on these direct measurements rather than any derivation, fitted parameter, or self-referential equation. No mathematical model, uniqueness theorem, or ansatz is invoked; the attribution to training data is presented as an interpretive hypothesis, not a load-bearing step that reduces to the inputs by construction. The evaluation framework is therefore self-contained against the provided human baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 630 human utterances form a representative sample of novice reasoning and that the two prompting conditions isolate the effect of context richness. No free parameters or invented entities are introduced.

axioms (1)

domain assumption Human think-aloud utterances collected in the tutoring logs accurately reflect the fragmented and imperfect reasoning that characterizes novice learning.
Invoked in the abstract when contrasting LLM outputs with human data.

pith-pipeline@v0.9.0 · 5507 in / 1200 out tokens · 51417 ms · 2026-05-16T09:15:36.246415+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

Hand- book of metacognition in education pp

Azevedo, R., Witherspoon, A.M.: Self-regulated learning with hypermedia. Hand- book of metacognition in education pp. 319–339 (2009)

work page 2009
[2]

In: International Conference on Artificial Intelligence in Education

Borchers, C., Shou, T.: Can large language models match tutoring system adaptiv- ity? a benchmarking study. In: International Conference on Artificial Intelligence in Education. pp. 407–420. Springer (2025) 8 Borchers et al

work page 2025
[3]

In: Proceedings of the 14th learning analytics and knowledge conference

Borchers, C., Zhang, J., Baker, R.S., Aleven, V.: Using think-aloud data to un- derstand relations between self-regulation cycle characteristics and student perfor- mance in intelligent tutoring systems. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 529–539 (2024)

work page 2024
[4]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

MIT Press, Cambridge, MA (1985)

Carey, S.: Conceptual Change in Childhood. MIT Press, Cambridge, MA (1985)

work page 1985
[6]

Cognitive science5(2), 121–152 (1981)

Chi, M.T., Feltovich, P.J., Glaser, R.: Categorization and representation of physics problems by experts and novices. Cognitive science5(2), 121–152 (1981)

work page 1981
[7]

Nature Reviews Psychology2(11), 688–701 (2023)

Demszky, D., Yang, D., Yeager, D.S., Bryan, C.J., Clapper, M., Chandhok, S., Eichstaedt, J.C., Hecht, C., Jamieson, J., Johnson, M., et al.: Using large language models in psychology. Nature Reviews Psychology2(11), 688–701 (2023)

work page 2023
[8]

(eds.): The Cam- bridge Handbook of Expertise and Expert Performance

Ericsson, K.A., Hoffman, R.R., Kozbelt, A., Williams, A.M. (eds.): The Cam- bridge Handbook of Expertise and Expert Performance. Cambridge University Press, Cambridge, UK, 2 edn. (2018). https://doi.org/10.1017/9781316480748

work page doi:10.1017/9781316480748 2018
[9]

ed.) mit press

Ericsson, K., Simon, H.: Protocol analysis: Verbal reports as data (rev. ed.) mit press. Cambridge, MA (1993)

work page 1993
[10]

Cognitive science40(5), 1251–1269 (2016)

Fisher, M., Keil, F.C.: The curse of expertise: When more knowledge leads to miscalibrated explanatory insight. Cognitive science40(5), 1251–1269 (2016)

work page 2016
[11]

Hacker, D.J., Bol, L.: Calibration and self-regulated learning: Making the connec- tions. (2019)

work page 2019
[12]

In: Handbook of metamemory and memory, pp

Hacker, D.J., Bol, L., Keener, M.C.: Metacognition in education: A focus on cali- bration. In: Handbook of metamemory and memory, pp. 429–455. Psychology Press (2013)

work page 2013
[13]

Educational Psychology Review19, 239–264 (2007)

Koedinger, K.R., Aleven, V.:Exploring theassistance dilemmainexperiments with cognitive tutors. Educational Psychology Review19, 239–264 (2007)

work page 2007
[14]

Cognitive science36(5), 757–798 (2012)

Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learn- ing. Cognitive science36(5), 757–798 (2012)

work page 2012
[15]

Journal of Experimental Psychology: Learning, Memory, and Cognition 31(2), 187 (2005)

Koriat, A., Bjork, R.A.: Illusions of competence in monitoring one’s knowledge dur- ing study. Journal of Experimental Psychology: Learning, Memory, and Cognition 31(2), 187 (2005)

work page 2005
[16]

Routledge (2012)

Pressley, M., Afflerbach, P.: Verbal protocols of reading: The nature of construc- tively responsive reading. Routledge (2012)

work page 2012
[17]

In: International Conference on Artificial Intelligence in Education

Schmucker, R., Xia, M., Azaria, A., Mitchell, T.: Ruffle&riley: Insights from design- ing and evaluating a large language model-based conversational tutoring system. In: International Conference on Artificial Intelligence in Education. pp. 75–90. Springer (2024)

work page 2024
[18]

Educational Psychology Review35(4), 95 (2023)

Sweller, J.: The development of cognitive load theory: Replication crises and incor- poration of other theories can lead to theory expansion. Educational Psychology Review35(4), 95 (2023)

work page 2023
[19]

arXiv preprint arXiv:2412.16429 (2024)

Team, L., Modi, A., Veerubhotla, A.S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., et al.: Learnlm: Improving gemini for learning. arXiv preprint arXiv:2412.16429 (2024)

work page arXiv 2024
[20]

Thomas, D.R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., Koedinger, K.R.: Do tutors learn from equity training and can generative ai assess it? In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 505–515 (2025) LLMs as Students: Overly Coherent, Verbose, and Confident 9

work page 2025
[21]

Metacognition and learning 1(1), 3–14 (2006)

Veenman, M.V., Van Hout-Wolters, B.H., Afflerbach, P.: Metacognition and learn- ing: Conceptual and methodological considerations. Metacognition and learning 1(1), 3–14 (2006)

work page 2006
[22]

In: Proceedings of the 15th International Learning Analytics and Knowledge Conference

Venugopalan, D., Yan, Z., Borchers, C., Lin, J., Aleven, V.: Combining large lan- guage models with tutoring system intelligence: A case study in caregiver home- work support. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 373–383 (2025)

work page 2025
[23]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wiz- ardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

arXiv preprint arXiv:2402.01746 (2024)

Zhang, L., Lin, J., Borchers, C., Cao, M., Hu, X.: 3dg: a framework for using generative ai for handling sparse learner performance data from intelligent tutoring systems. arXiv preprint arXiv:2402.01746 (2024)

work page arXiv 2024