Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

Cristina Conati; Ivan Orozco Vasquez; Minju Park

arxiv: 2605.04534 · v2 · pith:3V4ICJH2new · submitted 2026-05-06 · 💻 cs.HC

Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

Minju Park , Ivan Orozco Vasquez , Cristina Conati This is my paper

Pith reviewed 2026-05-08 16:55 UTC · model grok-4.3

classification 💻 cs.HC

keywords LLM usagecritical thinkingstudent behaviorslearning outcomesAI in educationpaper critique tasksusage categorization

0 comments

The pith

Refined categorization of how students use LLMs in paper critique homework links usage types and initiative levels to midterm performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes self-reported LLM usage from students completing homework that involves reading, reasoning about, and critiquing academic papers across two offerings of a research-oriented course. It develops a bottom-up classification of usage behaviors grouped by the amount of student initiative each type requires, extending an earlier analysis from one course. The study then tests whether frequency of use and specific usage categories relate to scores on three midterms that evaluate critical thinking skills. A sympathetic reader would care because the work addresses real, unrestricted student practices in a domain where LLMs are already common. If higher-initiative usages support better outcomes while lower-initiative ones do not, the distinction supplies a concrete basis for shaping how students interact with these tools.

Core claim

The authors establish a refined bottom-up categorization of LLM usage types drawn from student reports, with each type labeled by the extent of student initiative involved in the interaction. They further examine associations between these categories, overall usage frequency, and performance on midterms that measure critical thinking, using data from two course offerings to extend prior single-offering findings.

What carries the argument

The bottom-up categorization of LLM usage types cross-labeled by the degree of student initiative required in each interaction.

If this is right

Usage types that preserve higher student initiative may align with stronger midterm performance in critical thinking tasks.
Frequency of LLM use by itself may show weaker connections to learning outcomes than the specific type of use.
Patterns identified across two course offerings indicate repeatable behaviors that can be anticipated in similar settings.
The initiative dimension offers a practical lever for distinguishing supportive from less supportive LLM practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Course designers could test prompts or guidelines that steer students toward higher-initiative LLM uses to support critical thinking practice.
Validation of self-reports against tool logs would strengthen the evidence base for these associations.
Applying the same initiative-labeled categorization to other subjects like problem-solving could reveal whether the pattern holds beyond paper critique.

Load-bearing premise

Students' self-reported LLM usage practices accurately reflect their actual behaviors on the homework assignments and that performance on the three midterms validly measures learning gains in critical thinking from those activities.

What would settle it

Direct logging or observation of students' actual LLM interactions during the assignments to check agreement with self-reports, or use of a pre-post assessment specifically targeting paper critique skills instead of relying on midterm scores.

Figures

Figures reproduced from arXiv: 2605.04534 by Cristina Conati, Ivan Orozco Vasquez, Minju Park.

**Figure 1.** Figure 1: LLM usage rates for discussion points and critical view at source ↗

**Figure 2.** Figure 2: Temporal visualization of reported LLM use across assignments. view at source ↗

**Figure 3.** Figure 3: Counts for each LLM usage category for discussion view at source ↗

**Figure 6.** Figure 6: Comparisons of midterm exam scores between stu view at source ↗

**Figure 5.** Figure 5: Counts for each LLM usage category, with stacked view at source ↗

**Figure 8.** Figure 8: Comparisons of midterm exam scores between view at source ↗

**Figure 7.** Figure 7: Comparisons of midterm exam scores between stu view at source ↗

read the original abstract

Large language models (LLMs) are becoming increasingly embedded in students' learning practices, yet much of what is known about how students use LLMs and how this usage impacts learning comes from problem-solving domains or constrained experimental settings. We present an analysis of data on LLM usage collected during two offerings of a research-oriented course where students learn to read, reason about, and critique academic papers. Without restrictions on whether or how to use LLMs, students reported their LLM usage practices when asked to do these activities as a series of homework assignments during the course. This paper extends prior work done on data from a single offering of the same course by presenting a refined bottom-up categorization of LLM usage types, cross-labeled by the extent of student initiative these usages entail. Furthermore, we examine how LLM use impacts student learning, measured by performance on three midterms, looking at factors such as frequency and type of usage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Refined taxonomy of LLM usage in critical thinking homework from two course offerings, with links to midterm performance, but self-reports and outcome measures need validation.

read the letter

The main things to know are that this paper extends their earlier single-offering study by pulling in data from a second semester of the same research-oriented course on reading and critiquing papers. They come up with a refined set of LLM usage categories, labeled also by how much the student took the initiative, and then check whether frequency or type of use relates to how students did on the three midterms. They do a decent job of describing real student practices in an open setting without forcing or banning LLM use. The bottom-up categorization and the initiative dimension give a clearer picture than just counting how often students used the tools. The soft spots are the reliance on self-reported usage after the assignments, which might not line up with what actually happened, and the use of midterm scores as a stand-in for learning gains from those specific tasks. Without any cross-checks like logs or pre/post measures, and no controls for things like overall student ability, it's tough to draw strong conclusions about impact on learning. This is for people working on LLM integration in education, especially in HCI or ed tech. Anyone thinking about how to handle these tools in classes that emphasize critical analysis will get some concrete examples from it. It deserves peer review because the setting is authentic and the questions are relevant right now, though the authors will probably need to address the measurement issues in revisions.

Referee Report

2 major / 1 minor

Summary. The manuscript analyzes self-reported LLM usage data collected from students in two offerings of a research-oriented course on reading, reasoning about, and critiquing academic papers. It refines a bottom-up taxonomy of LLM usage types cross-labeled by the level of student initiative, and examines associations between usage frequency, type, and performance on three midterms as a proxy for learning gains in critical thinking skills.

Significance. If the self-report data and midterm associations prove reliable, the work offers a practically useful taxonomy for understanding LLM integration in open-ended critical thinking tasks and extends single-offering findings to multiple course instances. This could inform pedagogical guidelines in HCI and education research on AI-assisted learning.

major comments (2)

[Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.
[Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.

minor comments (1)

[Abstract] The abstract would benefit from including sample sizes, key statistical methods, and a brief summary of main findings to allow readers to assess the strength of the reported associations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section on data collection: The central taxonomy and associations rest on retrospective self-reports of LLM usage and initiative without apparent triangulation against platform logs, browser histories, or think-aloud data; this directly weakens the claim that the refined categorization accurately reflects behaviors during the homework assignments.

Authors: We recognize the limitation of relying solely on retrospective self-reports without additional triangulation methods such as platform logs or think-aloud protocols. Our taxonomy was developed bottom-up from the self-reported data, and the initiative labels were assigned based on students' descriptions of their usage. To address this, we will revise the Methods and Limitations sections to more explicitly acknowledge the potential discrepancies between reported and actual behaviors, and we will moderate our language regarding how accurately the categorization reflects in-situ behaviors. We will also suggest in the discussion that future studies could incorporate logging for validation. revision: partial
Referee: [Results/Analysis] Section describing learning outcomes and analysis: Midterm performance is used to measure learning gains from the LLM-supported homework without reported controls for prior knowledge, general academic ability, pre/post design, or motivation; this makes it difficult to isolate the impact of usage frequency and type on critical-thinking skill development.

Authors: We agree that our study design does not allow for causal inferences about the impact of LLM usage on learning gains. The analysis focuses on associations between usage patterns and midterm performance as a proxy for critical thinking skills. We did not collect pre-course measures of prior knowledge or motivation, nor did we use a pre/post design. We will revise the Results, Analysis, and Discussion sections to clearly frame our findings as correlational associations rather than evidence of learning gains attributable to specific usage types. We will also expand the limitations section to highlight these design constraints and the need for controlled studies in future work. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical analysis

full rationale

The paper describes an observational study collecting self-reported LLM usage from two course offerings, applying bottom-up categorization to the reports, and correlating usage frequency/types with existing midterm grades as a learning measure. No equations, derivations, fitted parameters, or predictive models are referenced that could reduce to inputs by construction. The abstract and structure indicate standard empirical data analysis without self-definitional loops, fitted-input predictions, or load-bearing self-citations that would force the central claims. The derivation chain is self-contained as data collection and association testing.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of self-reported usage data and the assumption that midterm performance captures the targeted learning outcomes; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Self-reported LLM usage practices accurately reflect students' actual behaviors during homework assignments.
The study relies entirely on students reporting their practices when asked, with no independent verification such as logs or observations.
domain assumption Midterm performance validly measures learning in critical thinking skills developed through the homework activities.
The paper uses midterm scores as the outcome variable without discussing alternative measures or potential confounds.

pith-pipeline@v0.9.0 · 5459 in / 1413 out tokens · 96949 ms · 2026-05-08T16:55:24.187089+00:00 · methodology

Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)