pith. machine review for the scientific record. sign in

arxiv: 2602.01015 · v2 · submitted 2026-02-01 · 💻 cs.CL · cs.CY

Recognition: 1 theorem link

· Lean Theorem

Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:15 UTC · model grok-4.3

classification 💻 cs.CL cs.CY
keywords large language modelsthink-aloud protocolstutoring systemsnovice reasoningmetacognitionchemistry educationAI tutoringlearner modeling
0
0 comments X

The pith

Large language models generate think-aloud reasoning that is more coherent, verbose, and confident than human students in chemistry tutoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether large language models can accurately simulate the imperfect, step-by-step reasoning of novice students. It compares model outputs to 630 real think-aloud utterances from chemistry problem-solving sessions. Models produce smoother and longer responses that show less variation than humans, especially when given more context about the problem. They also tend to predict higher success rates for learners than what actually occurs. This points to basic limits in using current LLMs to model the messy nature of human learning.

Core claim

When asked to continue think-aloud protocols as if they were novice learners solving multi-step chemistry problems, GPT-4.1 produces continuations that are systematically more coherent, verbose, and less variable than actual human utterances. These differences grow stronger when the model receives richer problem-solving context in the prompt. The models also consistently overestimate how likely learners are to succeed at each step.

What carries the argument

Direct comparison of LLM-generated think-aloud continuations against a baseline of 630 human utterances, using both minimal and extended contextual prompting, plus evaluation of step-level success predictions.

If this is right

  • AI tutoring systems based on LLMs may assume more organized thinking than students actually show.
  • Performance forecasts from such models will be biased toward overestimating learner success.
  • Providing more problem context to the model widens the gap from real student behavior.
  • LLM outputs lack the expressions of uncertainty and working-memory limits seen in novice reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adaptive systems might need added mechanisms to introduce realistic variability and doubt into simulated student responses.
  • Training data focused on expert solutions likely contributes to the overly polished outputs observed.
  • Applying the same evaluation to other subjects could reveal whether the pattern holds beyond chemistry.
  • Future models fine-tuned on learner data rather than expert solutions may close some of these gaps.

Load-bearing premise

The differences between model and human reasoning stem mainly from the models' training on expert solutions rather than from details of the prompting method or the size of the human baseline set.

What would settle it

Collect new think-aloud data from students and test whether blind human judges can reliably tell LLM-generated utterances apart from human ones, or whether the models' success predictions match actual step outcomes in fresh tutoring sessions.

Figures

Figures reproduced from arXiv: 2602.01015 by Conrad Borchers, Jill-J\^enn Vie, Roger Azevedo.

Figure 1
Figure 1. Figure 1: Average linguistic properties of ground-truth learner utterances and model￾generated reasoning under simple and complex prompting [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript evaluates LLMs (focusing on GPT-4.1) as simulators of novice reasoning by generating think-aloud continuations for 630 human utterances from multi-step chemistry tutoring sessions that include logs of hint use, attempts, and context. It compares outputs under minimal versus extended contextual prompting, reports that LLM reasoning is systematically over-coherent, verbose, and low-variability relative to human data, finds stronger mismatches with richer context, and shows consistent overestimation of learner success; these patterns are attributed to training on expert solutions lacking affect and working-memory constraints.

Significance. If the central patterns hold after controls, the work provides a useful empirical framework for assessing how well generative models capture the fragmented, affect-laden nature of novice problem-solving, with direct relevance to AI tutoring system design. The direct use of logged student data as a baseline is a concrete strength that moves beyond accuracy-only evaluations.

major comments (3)
  1. [Methods] Methods section: inter-rater reliability statistics for the coding of the 630 human utterances are not reported, which is required to establish the stability of the human baseline against which LLM outputs are compared.
  2. [Discussion] Discussion: the causal attribution of over-coherence and verbosity primarily to training data on expert solutions is load-bearing for the epistemic-limitation claim, yet no ablation tests alternative prompt phrasings (e.g., explicit novice-fragility instructions) or alternative human corpora matched on difficulty and length, leaving open confounding by the chosen prompting setup.
  3. [Results] Results: the intensification of effects under richer context is central, but the manuscript does not detail how LLM prompts were matched to the human baseline on task difficulty, hint usage, and utterance length, weakening the claim that context richness itself drives the mismatch.
minor comments (2)
  1. [Abstract] Abstract and throughout: model naming is inconsistent (GPT-4.1 vs. GPT-4); standardize and specify the exact checkpoint used.
  2. [Figures] Figure captions: ensure all panels clearly label minimal vs. extended prompting conditions and report exact sample sizes per condition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed feedback, which has helped us clarify several aspects of our methodology and strengthen the claims in the manuscript. We address each major comment below.

read point-by-point responses
  1. Referee: [Methods] Methods section: inter-rater reliability statistics for the coding of the 630 human utterances are not reported, which is required to establish the stability of the human baseline against which LLM outputs are compared.

    Authors: We thank the referee for pointing this out. The coding of the 630 human utterances was performed by two independent coders, and we have now added the inter-rater reliability statistics (Cohen's kappa = 0.82) to the Methods section. This confirms the stability of the human baseline. revision: yes

  2. Referee: [Discussion] Discussion: the causal attribution of over-coherence and verbosity primarily to training data on expert solutions is load-bearing for the epistemic-limitation claim, yet no ablation tests alternative prompt phrasings (e.g., explicit novice-fragility instructions) or alternative human corpora matched on difficulty and length, leaving open confounding by the chosen prompting setup.

    Authors: We acknowledge that direct ablations with alternative prompts would provide stronger evidence. However, our attribution is grounded in the systematic patterns observed across minimal and extended contexts, which align with known characteristics of LLM training data. In the revised manuscript, we have expanded the Discussion to explicitly discuss potential confounding factors and alternative explanations, including the possibility of prompt sensitivity. We note that testing alternative human corpora would require additional data collection beyond the scope of this study, but our use of logged student data provides a strong baseline. revision: partial

  3. Referee: [Results] Results: the intensification of effects under richer context is central, but the manuscript does not detail how LLM prompts were matched to the human baseline on task difficulty, hint usage, and utterance length, weakening the claim that context richness itself drives the mismatch.

    Authors: We apologize for the lack of detail. The LLM prompts were constructed to match the human baseline exactly on the problem context, including task difficulty (same problems), hint usage logs, and preceding utterance lengths. We have now added a detailed description in the Results section (new subsection on prompt construction) explaining the matching procedure, including examples of the minimal and extended prompts used. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparisons between LLM outputs and human data

full rationale

The paper conducts side-by-side analysis of GPT-4.1 continuations against 630 human think-aloud utterances collected from chemistry tutoring sessions, using both minimal and extended prompting. All central claims (over-coherence, verbosity, reduced variability, and overestimation of learner success) rest on these direct measurements rather than any derivation, fitted parameter, or self-referential equation. No mathematical model, uniqueness theorem, or ansatz is invoked; the attribution to training data is presented as an interpretive hypothesis, not a load-bearing step that reduces to the inputs by construction. The evaluation framework is therefore self-contained against the provided human baseline.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the 630 human utterances form a representative sample of novice reasoning and that the two prompting conditions isolate the effect of context richness. No free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Human think-aloud utterances collected in the tutoring logs accurately reflect the fragmented and imperfect reasoning that characterizes novice learning.
    Invoked in the abstract when contrasting LLM outputs with human data.

pith-pipeline@v0.9.0 · 5507 in / 1200 out tokens · 51417 ms · 2026-05-16T09:15:36.246415+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    Hand- book of metacognition in education pp

    Azevedo, R., Witherspoon, A.M.: Self-regulated learning with hypermedia. Hand- book of metacognition in education pp. 319–339 (2009)

  2. [2]

    In: International Conference on Artificial Intelligence in Education

    Borchers, C., Shou, T.: Can large language models match tutoring system adaptiv- ity? a benchmarking study. In: International Conference on Artificial Intelligence in Education. pp. 407–420. Springer (2025) 8 Borchers et al

  3. [3]

    In: Proceedings of the 14th learning analytics and knowledge conference

    Borchers, C., Zhang, J., Baker, R.S., Aleven, V.: Using think-aloud data to un- derstand relations between self-regulation cycle characteristics and student perfor- mance in intelligent tutoring systems. In: Proceedings of the 14th learning analytics and knowledge conference. pp. 529–539 (2024)

  4. [4]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Bubeck,S.,Chandrasekaran,V.,Eldan,R.,Gehrke,J.,Horvitz,E.,Kamar,E.,Lee, P., Lee, Y.T., Li, Y., Lundberg, S., et al.: Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712 (2023)

  5. [5]

    MIT Press, Cambridge, MA (1985)

    Carey, S.: Conceptual Change in Childhood. MIT Press, Cambridge, MA (1985)

  6. [6]

    Cognitive science5(2), 121–152 (1981)

    Chi, M.T., Feltovich, P.J., Glaser, R.: Categorization and representation of physics problems by experts and novices. Cognitive science5(2), 121–152 (1981)

  7. [7]

    Nature Reviews Psychology2(11), 688–701 (2023)

    Demszky, D., Yang, D., Yeager, D.S., Bryan, C.J., Clapper, M., Chandhok, S., Eichstaedt, J.C., Hecht, C., Jamieson, J., Johnson, M., et al.: Using large language models in psychology. Nature Reviews Psychology2(11), 688–701 (2023)

  8. [8]

    (eds.): The Cam- bridge Handbook of Expertise and Expert Performance

    Ericsson, K.A., Hoffman, R.R., Kozbelt, A., Williams, A.M. (eds.): The Cam- bridge Handbook of Expertise and Expert Performance. Cambridge University Press, Cambridge, UK, 2 edn. (2018). https://doi.org/10.1017/9781316480748

  9. [9]

    ed.) mit press

    Ericsson, K., Simon, H.: Protocol analysis: Verbal reports as data (rev. ed.) mit press. Cambridge, MA (1993)

  10. [10]

    Cognitive science40(5), 1251–1269 (2016)

    Fisher, M., Keil, F.C.: The curse of expertise: When more knowledge leads to miscalibrated explanatory insight. Cognitive science40(5), 1251–1269 (2016)

  11. [11]

    Hacker, D.J., Bol, L.: Calibration and self-regulated learning: Making the connec- tions. (2019)

  12. [12]

    In: Handbook of metamemory and memory, pp

    Hacker, D.J., Bol, L., Keener, M.C.: Metacognition in education: A focus on cali- bration. In: Handbook of metamemory and memory, pp. 429–455. Psychology Press (2013)

  13. [13]

    Educational Psychology Review19, 239–264 (2007)

    Koedinger, K.R., Aleven, V.:Exploring theassistance dilemmainexperiments with cognitive tutors. Educational Psychology Review19, 239–264 (2007)

  14. [14]

    Cognitive science36(5), 757–798 (2012)

    Koedinger, K.R., Corbett, A.T., Perfetti, C.: The knowledge-learning-instruction framework: Bridging the science-practice chasm to enhance robust student learn- ing. Cognitive science36(5), 757–798 (2012)

  15. [15]

    Journal of Experimental Psychology: Learning, Memory, and Cognition 31(2), 187 (2005)

    Koriat, A., Bjork, R.A.: Illusions of competence in monitoring one’s knowledge dur- ing study. Journal of Experimental Psychology: Learning, Memory, and Cognition 31(2), 187 (2005)

  16. [16]

    Routledge (2012)

    Pressley, M., Afflerbach, P.: Verbal protocols of reading: The nature of construc- tively responsive reading. Routledge (2012)

  17. [17]

    In: International Conference on Artificial Intelligence in Education

    Schmucker, R., Xia, M., Azaria, A., Mitchell, T.: Ruffle&riley: Insights from design- ing and evaluating a large language model-based conversational tutoring system. In: International Conference on Artificial Intelligence in Education. pp. 75–90. Springer (2024)

  18. [18]

    Educational Psychology Review35(4), 95 (2023)

    Sweller, J.: The development of cognitive load theory: Replication crises and incor- poration of other theories can lead to theory expansion. Educational Psychology Review35(4), 95 (2023)

  19. [19]

    arXiv preprint arXiv:2412.16429 (2024)

    Team, L., Modi, A., Veerubhotla, A.S., Rysbek, A., Huber, A., Wiltshire, B., Veprek, B., Gillick, D., Kasenberg, D., Ahmed, D., et al.: Learnlm: Improving gemini for learning. arXiv preprint arXiv:2412.16429 (2024)

  20. [20]

    Thomas, D.R., Borchers, C., Kakarla, S., Lin, J., Bhushan, S., Guo, B., Gatz, E., Koedinger, K.R.: Do tutors learn from equity training and can generative ai assess it? In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 505–515 (2025) LLMs as Students: Overly Coherent, Verbose, and Confident 9

  21. [21]

    Metacognition and learning 1(1), 3–14 (2006)

    Veenman, M.V., Van Hout-Wolters, B.H., Afflerbach, P.: Metacognition and learn- ing: Conceptual and methodological considerations. Metacognition and learning 1(1), 3–14 (2006)

  22. [22]

    In: Proceedings of the 15th International Learning Analytics and Knowledge Conference

    Venugopalan, D., Yan, Z., Borchers, C., Lin, J., Aleven, V.: Combining large lan- guage models with tutoring system intelligence: A case study in caregiver home- work support. In: Proceedings of the 15th International Learning Analytics and Knowledge Conference. pp. 373–383 (2025)

  23. [23]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Xu, C., Sun, Q., Zheng, K., Geng, X., Zhao, P., Feng, J., Tao, C., Jiang, D.: Wiz- ardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244 (2023)

  24. [24]

    arXiv preprint arXiv:2402.01746 (2024)

    Zhang, L., Lin, J., Borchers, C., Cao, M., Hu, X.: 3dg: a framework for using generative ai for handling sparse learner performance data from intelligent tutoring systems. arXiv preprint arXiv:2402.01746 (2024)