Your Students Don't Use LLMs Like You Wish They Did

Angela Sun; Jonathan K. Kummerfeld; Matthew Clemson; Sebastian Kobler

arxiv: 2604.23486 · v1 · submitted 2026-04-26 · 💻 cs.CL · cs.CY· cs.HC

Your Students Don't Use LLMs Like You Wish They Did

Sebastian Kobler , Matthew Clemson , Angela Sun , Jonathan K. Kummerfeld This is my paper

Pith reviewed 2026-05-08 06:23 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.HC

keywords educational AIstudent-AI dialoguepedagogical alignmentcomputational metricsusage patternsdeployment contextanswer extractionconversational tutors

0 comments

The pith

Students treat AI tutors as answer-extraction tools rather than partners in sustained learning dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that standard evaluation methods like engagement counts and satisfaction surveys fail to reveal whether educational AI systems actually support teaching goals. It develops six computational metrics that track turn-by-turn alignment between student messages and intended pedagogical patterns. When these metrics are applied to more than twelve thousand messages from five hundred real course conversations, they expose a consistent gap: instructors design the systems for ongoing back-and-forth learning, yet students primarily seek direct solutions to assignment questions. The strongest driver of this behavior is not student preference or the AI's capabilities but the way the tool is placed in the course—optional tools see spikes near deadlines, while integrated tools prompt verbatim copying of assignment text. Whole-conversation summaries hide these patterns, so the new metrics are offered as a practical way for builders of educational dialogue systems to check whether their designs are producing the intended learning interactions.

Core claim

Educators intend conversational tutors to produce sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context predicts usage patterns more strongly than student preference or system design: optional tools concentrate activity around deadlines, while tools built into course structure lead students to request solutions to verbatim assignment questions. Turn-by-turn analysis reveals these behaviors that whole-dialogue metrics overlook.

What carries the argument

The six computational metrics that automatically score pedagogical alignment in each turn of student-AI dialogue.

If this is right

Integrating AI tools into assignments produces requests for direct solutions to the exact questions students must answer.
Making AI tools optional leads to usage spikes near deadlines rather than steady learning use.
Turn-by-turn metrics detect usage patterns that overall conversation summaries conceal.
Researchers can apply the metrics to test whether new educational dialogue systems meet their stated pedagogical targets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Changing assignment structure or grading expectations may be required to shift students toward deeper dialogue even if the AI itself improves.
The metrics could be applied to non-educational chat systems to check whether users treat them as problem-solving shortcuts rather than learning aids.
Longer-term studies could test whether the observed usage patterns correlate with differences in exam performance or retention.

Load-bearing premise

The six metrics measure true pedagogical alignment even though they have not been checked against actual student learning gains.

What would settle it

A controlled comparison showing equal or higher learning gains in classes where the AI is integrated and students request verbatim solutions would falsify the claim of misalignment.

Figures

Figures reproduced from arXiv: 2604.23486 by Angela Sun, Jonathan K. Kummerfeld, Matthew Clemson, Sebastian Kobler.

**Figure 1.** Figure 1: Usage concentration across constrained plat view at source ↗

**Figure 2.** Figure 2: Conversational Engagement Score vs Learning Orientation Index for 500 conversations. Blue points: view at source ↗

**Figure 3.** Figure 3: Temporal heatmap showing message volume across academic semesters for all five datasets. Darker view at source ↗

**Figure 4.** Figure 4: Crisis mode behavioural changes. Each panel shows the percentage change from baseline to peak view at source ↗

**Figure 5.** Figure 5: Crisis mode behavioural changes . Each panel shows the percentage change from baseline to peak view at source ↗

**Figure 6.** Figure 6: Overall crisis mode scores across four optional-tool courses. All datasets show some shifts (0.19-0.24 view at source ↗

read the original abstract

Educational NLP systems are typically evaluated using engagement metrics and satisfaction surveys, which are at best a proxy for meeting pedagogical goals. We introduce six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. We validate our metrics through analysis of 12,650 messages across 500 conversations from four courses. Using our metrics, we identify a fundamental misalignment: educators design conversational tutors for sustained learning dialogue, but students mainly use them for answer-extraction. Deployment context is the strongest predictor of usage patterns, outweighing student preference or system design: when AI tools are optional, usage concentrates around deadlines; when integrated into course structure, students ask for solutions to verbatim assignment questions. Whole-dialogue evaluation misses these turn-by-turn patterns. Our metrics will enable researchers building educational dialogue systems to measure whether they are achieving their pedagogical goals.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives researchers six new turn-by-turn metrics to check if student-AI dialogues actually support learning goals instead of just counting engagement, and shows that course integration drives answer-seeking behavior more than design or preference.

read the letter

The main takeaway is that this work supplies concrete computational metrics for spotting pedagogical misalignment in student-AI chats, backed by analysis of 12,650 real messages from 500 conversations across four courses. It finds students mostly extract answers rather than sustain learning dialogue, with deployment context as the strongest driver: optional tools see deadline spikes, while integrated ones pull verbatim assignment questions. Whole-dialogue checks miss these patterns, so the metrics target turn-level evaluation instead of broad proxies like satisfaction surveys. That addresses a clear gap in how educational NLP systems get assessed. The data scale and focus on actual course logs are strengths; they ground the observations in real usage rather than lab setups. The metrics appear defined to capture alignment aspects like question type and response depth, which lets the authors quantify the mismatch directly. Soft spots center on validation. The metrics rest on internal pattern detection in the dataset without reported ties to measured learning gains, expert ratings of dialogue quality, or pre/post assessments. This makes the interpretation of verbatim usage as misalignment reasonable but still proxy-based, and it weakens claims that context outweighs other factors until those links are checked. Minor gaps include fuller detail on inter-rater reliability for any human-anchored steps and controls for course-specific confounds. The work suits researchers building or evaluating educational dialogue systems who need better tools than engagement counts. A reader in human-AI learning or educational NLP gets practical value from the metrics and the deployment insight. It deserves serious referee time because the data is substantive and the problem is real, even if revisions should tighten the external grounding of the metrics.

Referee Report

2 major / 1 minor

Summary. The paper introduces six computational metrics for automated evaluation of pedagogical alignment in student-AI dialogue. Analysis of 12,650 messages across 500 conversations from four courses reveals a misalignment: educators intend sustained learning dialogue but students primarily use the tools for answer extraction. Deployment context (optional vs. integrated into course structure) is identified as the strongest predictor of usage patterns, outweighing preferences or design, with whole-dialogue evaluation missing turn-by-turn patterns. The metrics are validated via this dataset analysis and positioned to help researchers measure alignment with pedagogical goals.

Significance. If the metrics prove reliable, the work offers a practical framework for assessing real-world educational LLM use beyond engagement proxies, supported by a sizable multi-course dataset of authentic conversations. This could inform better system design by highlighting context effects and turn-level behaviors, advancing educational NLP evaluation practices.

major comments (2)

[Abstract and validation section] Abstract and validation section: The metrics are validated through internal pattern detection on the 12,650 messages, yet no external validation (e.g., correlation with learning outcomes, pre/post assessments, or expert-rated dialogue quality) is reported. This leaves the interpretation of 'pedagogical alignment' and the classification of verbatim-question usage as misalignment dependent on untested proxies, which is load-bearing for the central claim that context is the dominant predictor.
[Results on deployment context] Results on deployment context: The cross-course comparison attributes usage differences primarily to optional vs. integrated deployment, but lacks reported controls for confounders such as course subject matter, assignment types, or student demographics. Without these, the claim that context outweighs preferences or design cannot be securely established from the observational data.

minor comments (1)

[Abstract] The abstract states that 'whole-dialogue evaluation misses these turn-by-turn patterns,' but the manuscript could more explicitly describe how each of the six metrics operates at the turn level versus aggregating over dialogues.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We respond to each major point below, indicating where we will revise the manuscript to address the concerns while preserving the integrity of our observational analysis.

read point-by-point responses

Referee: Abstract and validation section: The metrics are validated through internal pattern detection on the 12,650 messages, yet no external validation (e.g., correlation with learning outcomes, pre/post assessments, or expert-rated dialogue quality) is reported. This leaves the interpretation of 'pedagogical alignment' and the classification of verbatim-question usage as misalignment dependent on untested proxies, which is load-bearing for the central claim that context is the dominant predictor.

Authors: We acknowledge that the validation relies on internal consistency with expected pedagogical patterns rather than external measures such as learning outcomes or expert ratings. The metrics were designed to be computable from dialogue logs alone, and their utility is demonstrated by surfacing clear usage differences across the 500 conversations. We do not possess pre/post assessment data or expert annotations for this dataset. In the revised manuscript, we will add an explicit Limitations section that discusses the proxy-based nature of the misalignment interpretation and calls for future studies to perform external validation. This will qualify the claims appropriately without overstating the current evidence. revision: yes
Referee: Results on deployment context: The cross-course comparison attributes usage differences primarily to optional vs. integrated deployment, but lacks reported controls for confounders such as course subject matter, assignment types, or student demographics. Without these, the claim that context outweighs preferences or design cannot be securely established from the observational data.

Authors: The analysis draws on observational data from four courses that differ in both deployment context and other characteristics. With only four courses, statistical controls for all potential confounders are not feasible. We will revise the Results and Discussion sections to include a more detailed enumeration of possible confounding variables and to frame the findings as identifying context as the strongest observed correlate rather than proving it outweighs all other factors. We will also note that experimental designs would be required for stronger causal claims. These changes will clarify the scope of the conclusions while retaining the empirical patterns identified in the data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical observational study with independent data analysis

full rationale

The paper introduces six computational metrics for evaluating student-AI dialogue and applies them directly to an external dataset of 12,650 messages across 500 conversations from four courses. No equations, derivations, fitted parameters, or predictions are present. The central claim that deployment context is the strongest predictor emerges from pattern detection in the observed data rather than from any self-referential definition or self-citation chain. Validation consists of applying the metrics to the collected conversations; this is standard empirical analysis and does not reduce the results to the inputs by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the introduced metrics capture pedagogical alignment and that the sampled conversations are representative of typical student-AI use.

axioms (1)

domain assumption The six computational metrics accurately reflect pedagogical alignment goals
Invoked to interpret usage patterns as misalignment; no external validation or learning-outcome correlation is described in the abstract.

pith-pipeline@v0.9.0 · 5444 in / 1138 out tokens · 40121 ms · 2026-05-08T06:23:15.080518+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

arXiv preprint arXiv:2503.07928 (2025)

The studychat dataset: Student dialogues with chatgpt in an artificial intelligence course.Preprint, arXiv:2503.07928. Janet Metcalfe. 2009. Metacognitive judgments and control of study.Current Directions in Psychological Science, 18(3):159–163. R Charles Murray and Kurt VanLehn. 2005. Effects of dissuading unnecessary help requests while pro- viding proa...

work page arXiv 2009
[2]

InArtificial Intelligence in Education, pages 132–145, Cham

Oliment: Conversations about open learner modelling to help learners understand and self-assess learning goals. InArtificial Intelligence in Education, pages 132–145, Cham. Springer Nature Switzerland. Andres Felipe Zambrano, Nidhi Nasiar, Jaclyn Ocumpaugh, Stephen Hutt, and Ryan S Baker. 2024. Says who? how different ground truth measures of emotion impa...

work page 2024
[3]

It involves retrieving relevant knowledge from long-term memory

Remembering: test the student’s ability to recall or recognise information, facts, and concepts. It involves retrieving relevant knowledge from long-term memory. Exam questions will rarely ask for remembering. The only time students will be asked to recall facts is if it is something important for conceptual understanding, e.g., features of DNA structure

work page
[4]

Exam questions will usually be at the level of understanding or above

Understanding: ask students to demonstrate their grasp of the meaning of material, which could include interpreting, exemplifying, classifying, summarising, inferring, comparing, and explaining. Exam questions will usually be at the level of understanding or above

work page
[5]

Applying: students are expected to use learned material in new and concrete situations, which may include applying rules, methods, concepts, principles, laws, and theories

work page
[6]

This might involve differentiating, organising, and attributing

Analysing: require students to break down informational materials into their component parts to understand their organisational structure. This might involve differentiating, organising, and attributing

work page
[7]

This can involve checking , critiquing, and making judgments about information, validity of ideas , or quality of work

Evaluating: students must make judgments based on criteria and standards. This can involve checking , critiquing, and making judgments about information, validity of ideas , or quality of work

work page
[8]

{previous_msg[’content ’][:500]}

Creating: involves putting elements together to form a coherent or functional whole, reorganising elements into a new pattern, or constructing new meanings and ideas. """ A.2 Metric Prompts The following sections contain the complete prompts used for LLM-based metric evaluation, along with implementation details for rule-based components. A.2.1 Conversati...

work page
[9]

FOLLOW-UP PATTERN: How often does the student build upon, reference, or continue discussion from AI responses? Consider: - Questions that expand on AI explanations - Requests for clarification or examples - Building on previous answers with related questions - Natural conversational flow vs isolated questions

work page
[10]

as you mentioned earlier

CONTEXT REFERENCES: How often does the student reference earlier parts of the conversation? Consider: - Explicit references to previous topics ("as you mentioned earlier") - Implicit connections between questions - Thematic continuity across the conversation - Building conceptual understanding over multiple turns

work page
[11]

I see",

ACKNOWLEDGMENTS: How often does the student acknowledge AI responses? Consider: - Thanks, appreciation, or gratitude expressions - Confirmations of understanding ("I see", "makes sense") - Reactions to AI explanations - Social engagement signals Provide your analysis in JSON format with scores from 0.0 to 1.0: { "followup_rate": <0.0-1.0>, "context_rate":...

work page
[12]

Response Type: - accepting: Student engages with the scaffolding approach - resisting: Student explicitly asks for direct answers or shows frustration - bypassing: Student reformulates to avoid the pedagogical approach - mixed: Shows both engagement and resistance

work page
[13]

If resisting/bypassing, what strategy ? - direct_request: Explicitly asks for the answer - ignore_guidance: Proceeds without addressing the scaffolding - reformulation: Rephrases to circumvent pedagogy - frustration_expression: Shows impatience/annoyance - minimal_engagement: Gives token response then asks for answer

work page
[14]

just tell me

Engagement Level: high, medium, or low Format your response as: response_type: [type] resistance_strategy: [strategy or none] engagement_level: [level] Listing 10:Whole Dialogue SRS Analysis Prompt Analyze this ENTIRE educational conversation to identify scaffolding events and student responses. {conversation_text} Identify each instance where the AI prov...

work page
[15]

Question 1:

COPY-PASTE INDICATORS: Does the student appear to be copying questions from an assignment? - Formal problem language ("Question 1:", "Part a)", "Problem 2.3") - Academic imperatives ("Calculate", "Determine", "Prove that") - Multiple numbered or lettered questions in sequence

work page
[16]

PROBLEM SET BEHAVIOUR: Is the student working through unrelated problems? - Jumping between topics without transition - Series of disconnected questions - Checklist-like progression

work page
[17]

ANSWER-SEEKING FOCUS: Is the student seeking answers vs understanding? - No follow-up questions after receiving answers - Lack of engagement with explanations - Focus on final solutions only

work page
[18]

URGENCY/DEADLINE SIGNALS: Are there signs of time pressure? - Mentions of due dates - References to class assignments - Rapid question sequences Response format: JSON with scores 0.0-1.0 for each indicator. Method 2: Rule-Based Pattern Detection Academic Imperatives Dictionary: imperatives = { ’calculate’, ’determine’, ’prove’, ’ show␣that’, ’derive’, ’fi...

work page
[19]

Calculate weekly message volumes across semester

work page
[20]

Identify baseline: weeks with usage <mean+ 0.5×std

work page
[21]

Peak period: week(s) with maximum usage

work page
[22]

?” in single message • Exclamations: Excessive “!

Minimum baseline: 2 weeks of activity re- quired Component Calculations: Panic Indicators (PI) – 30% weight: Detection patterns: • Urgency language: { asap, urgent, immediately,right now,quickly} • Repetition: Same question asked 2+ times within conversation • Caps lock: Messages with >30% capitalised words • Multiple questions: 3+ “?” in single message •...

work page

[1] [1]

arXiv preprint arXiv:2503.07928 (2025)

The studychat dataset: Student dialogues with chatgpt in an artificial intelligence course.Preprint, arXiv:2503.07928. Janet Metcalfe. 2009. Metacognitive judgments and control of study.Current Directions in Psychological Science, 18(3):159–163. R Charles Murray and Kurt VanLehn. 2005. Effects of dissuading unnecessary help requests while pro- viding proa...

work page arXiv 2009

[2] [2]

InArtificial Intelligence in Education, pages 132–145, Cham

Oliment: Conversations about open learner modelling to help learners understand and self-assess learning goals. InArtificial Intelligence in Education, pages 132–145, Cham. Springer Nature Switzerland. Andres Felipe Zambrano, Nidhi Nasiar, Jaclyn Ocumpaugh, Stephen Hutt, and Ryan S Baker. 2024. Says who? how different ground truth measures of emotion impa...

work page 2024

[3] [3]

It involves retrieving relevant knowledge from long-term memory

Remembering: test the student’s ability to recall or recognise information, facts, and concepts. It involves retrieving relevant knowledge from long-term memory. Exam questions will rarely ask for remembering. The only time students will be asked to recall facts is if it is something important for conceptual understanding, e.g., features of DNA structure

work page

[4] [4]

Exam questions will usually be at the level of understanding or above

Understanding: ask students to demonstrate their grasp of the meaning of material, which could include interpreting, exemplifying, classifying, summarising, inferring, comparing, and explaining. Exam questions will usually be at the level of understanding or above

work page

[5] [5]

Applying: students are expected to use learned material in new and concrete situations, which may include applying rules, methods, concepts, principles, laws, and theories

work page

[6] [6]

This might involve differentiating, organising, and attributing

Analysing: require students to break down informational materials into their component parts to understand their organisational structure. This might involve differentiating, organising, and attributing

work page

[7] [7]

This can involve checking , critiquing, and making judgments about information, validity of ideas , or quality of work

Evaluating: students must make judgments based on criteria and standards. This can involve checking , critiquing, and making judgments about information, validity of ideas , or quality of work

work page

[8] [8]

{previous_msg[’content ’][:500]}

Creating: involves putting elements together to form a coherent or functional whole, reorganising elements into a new pattern, or constructing new meanings and ideas. """ A.2 Metric Prompts The following sections contain the complete prompts used for LLM-based metric evaluation, along with implementation details for rule-based components. A.2.1 Conversati...

work page

[9] [9]

FOLLOW-UP PATTERN: How often does the student build upon, reference, or continue discussion from AI responses? Consider: - Questions that expand on AI explanations - Requests for clarification or examples - Building on previous answers with related questions - Natural conversational flow vs isolated questions

work page

[10] [10]

as you mentioned earlier

CONTEXT REFERENCES: How often does the student reference earlier parts of the conversation? Consider: - Explicit references to previous topics ("as you mentioned earlier") - Implicit connections between questions - Thematic continuity across the conversation - Building conceptual understanding over multiple turns

work page

[11] [11]

I see",

ACKNOWLEDGMENTS: How often does the student acknowledge AI responses? Consider: - Thanks, appreciation, or gratitude expressions - Confirmations of understanding ("I see", "makes sense") - Reactions to AI explanations - Social engagement signals Provide your analysis in JSON format with scores from 0.0 to 1.0: { "followup_rate": <0.0-1.0>, "context_rate":...

work page

[12] [12]

Response Type: - accepting: Student engages with the scaffolding approach - resisting: Student explicitly asks for direct answers or shows frustration - bypassing: Student reformulates to avoid the pedagogical approach - mixed: Shows both engagement and resistance

work page

[13] [13]

If resisting/bypassing, what strategy ? - direct_request: Explicitly asks for the answer - ignore_guidance: Proceeds without addressing the scaffolding - reformulation: Rephrases to circumvent pedagogy - frustration_expression: Shows impatience/annoyance - minimal_engagement: Gives token response then asks for answer

work page

[14] [14]

just tell me

Engagement Level: high, medium, or low Format your response as: response_type: [type] resistance_strategy: [strategy or none] engagement_level: [level] Listing 10:Whole Dialogue SRS Analysis Prompt Analyze this ENTIRE educational conversation to identify scaffolding events and student responses. {conversation_text} Identify each instance where the AI prov...

work page

[15] [15]

Question 1:

COPY-PASTE INDICATORS: Does the student appear to be copying questions from an assignment? - Formal problem language ("Question 1:", "Part a)", "Problem 2.3") - Academic imperatives ("Calculate", "Determine", "Prove that") - Multiple numbered or lettered questions in sequence

work page

[16] [16]

PROBLEM SET BEHAVIOUR: Is the student working through unrelated problems? - Jumping between topics without transition - Series of disconnected questions - Checklist-like progression

work page

[17] [17]

ANSWER-SEEKING FOCUS: Is the student seeking answers vs understanding? - No follow-up questions after receiving answers - Lack of engagement with explanations - Focus on final solutions only

work page

[18] [18]

URGENCY/DEADLINE SIGNALS: Are there signs of time pressure? - Mentions of due dates - References to class assignments - Rapid question sequences Response format: JSON with scores 0.0-1.0 for each indicator. Method 2: Rule-Based Pattern Detection Academic Imperatives Dictionary: imperatives = { ’calculate’, ’determine’, ’prove’, ’ show␣that’, ’derive’, ’fi...

work page

[19] [19]

Calculate weekly message volumes across semester

work page

[20] [20]

Identify baseline: weeks with usage <mean+ 0.5×std

work page

[21] [21]

Peak period: week(s) with maximum usage

work page

[22] [22]

?” in single message • Exclamations: Excessive “!

Minimum baseline: 2 weeks of activity re- quired Component Calculations: Panic Indicators (PI) – 30% weight: Detection patterns: • Urgency language: { asap, urgent, immediately,right now,quickly} • Repetition: Same question asked 2+ times within conversation • Caps lock: Messages with >30% capitalised words • Multiple questions: 3+ “?” in single message •...

work page