Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks
Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3
The pith
Refined categorization of how students use LLMs in paper critique homework links usage types and initiative levels to midterm performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish a refined bottom-up categorization of LLM usage types drawn from student reports, with each type labeled by the extent of student initiative involved in the interaction. They further examine associations between these categories, overall usage frequency, and performance on midterms that measure critical thinking, using data from two course offerings to extend prior single-offering findings.
What carries the argument
The bottom-up categorization of LLM usage types cross-labeled by the degree of student initiative required in each interaction.
If this is right
- Usage types that preserve higher student initiative may align with stronger midterm performance in critical thinking tasks.
- Frequency of LLM use by itself may show weaker connections to learning outcomes than the specific type of use.
- Patterns identified across two course offerings indicate repeatable behaviors that can be anticipated in similar settings.
- The initiative dimension offers a practical lever for distinguishing supportive from less supportive LLM practices.
Where Pith is reading between the lines
- Course designers could test prompts or guidelines that steer students toward higher-initiative LLM uses to support critical thinking practice.
- Validation of self-reports against tool logs would strengthen the evidence base for these associations.
- Applying the same initiative-labeled categorization to other subjects like problem-solving could reveal whether the pattern holds beyond paper critique.
Load-bearing premise
Students' self-reported LLM usage practices accurately reflect their actual behaviors on the homework assignments and that performance on the three midterms validly measures learning gains in critical thinking from those activities.
What would settle it
Direct logging or observation of students' actual LLM interactions during the assignments to check agreement with self-reports, or use of a pre-post assessment specifically targeting paper critique skills instead of relying on midterm scores.
Figures
read the original abstract
Large language models (LLMs) are becoming increasingly embedded in students' learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes self-reported LLM usage data collected from students in two offerings of a research-oriented course on reading, reasoning about, and critiquing academic papers. It refines a bottom-up taxonomy of LLM usage types cross-labeled by the level of student initiative, and examines associations between usage frequency, type, and performance on three midterms as a proxy for learning gains in critical thinking skills.
Significance. If the self-report data and midterm associations prove reliable, the work offers a practically useful taxonomy for understanding LLM integration in open-ended critical thinking tasks and extends single-offering findings to multiple course instances. This could inform pedagogical guidelines in HCI and education research on AI-assisted learning.
major comments (2)
- [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
- [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.
minor comments (1)
- [Abstract] The abstract would benefit from including sample sizes, key statistical methods, and a brief summary of main findings to allow readers to assess the strength of the reported associations.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We address each major comment below and outline the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
Authors: We recognize the limitation of relying solely on retrospective self-reports without additional triangulation methods such as platform logs or think-aloud protocols. Our taxonomy was developed bottom-up from the self-reported data, and the initiative labels were assigned based on students' descriptions of their usage. To address this, we will revise the Methods and Limitations sections to more explicitly acknowledge the potential discrepancies between reported and actual behaviors, and we will moderate our language regarding how accurately the categorization reflects in-situ behaviors. We will also suggest in the discussion that future studies could incorporate logging for validation. revision: partial
-
Referee: [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.
Authors: We agree that our study design does not allow for causal inferences about the impact of LLM usage on learning gains. The analysis focuses on associations between usage patterns and midterm performance as a proxy for critical thinking skills. We did not collect pre-course measures of prior knowledge or motivation, nor did we use a pre/post design. We will revise the Results, Analysis, and Discussion sections to clearly frame our findings as correlational associations rather than evidence of learning gains attributable to specific usage types. We will also expand the limitations section to highlight these design constraints and the need for controlled studies in future work. revision: yes
Circularity Check
No circularity: purely observational empirical analysis
full rationale
The paper describes an observational study collecting self-reported LLM usage from two course offerings, applying bottom-up categorization to the reports, and correlating usage frequency/types with existing midterm grades as a learning measure. No equations, derivations, fitted parameters, or predictive models are referenced that could reduce to inputs by construction. The abstract and structure indicate standard empirical data analysis without self-definitional loops, fitted-input predictions, or load-bearing self-citations that would force the central claims. The derivation chain is self-contained as data collection and association testing.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Self-reported LLM usage practices accurately reflect students' actual behaviors during homework assignments.
- domain assumption Midterm performance validly measures learning in critical thinking skills developed through the homework activities.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.