Warning About AI Fallibility Increases Help-Seeking in an Intelligent Tutoring System

Mirella Hladk\'y; Tomohiro Nagashima; Vera Rief

arxiv: 2606.03822 · v1 · pith:FE4C4GMDnew · submitted 2026-06-02 · 💻 cs.HC

Warning About AI Fallibility Increases Help-Seeking in an Intelligent Tutoring System

Tomohiro Nagashima , Mirella Hladk\'y , Vera Rief This is my paper

Pith reviewed 2026-06-28 08:20 UTC · model grok-4.3

classification 💻 cs.HC

keywords AI transparencyintelligent tutoring systemshelp-seeking behaviortrust calibrationeducational technologyAI fallibilitymath learningclassroom experiment

0 comments

The pith

Warning students that an AI tutor might err leads them to request more hints during problem-solving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether a simple message warning students about possible AI mistakes changes how they interact with a math intelligent tutoring system. In a classroom experiment, 252 school students used one of two identical system versions, with only the warning differing between groups. Students who received the warning asked for significantly more hints than those who did not. A sympathetic reader would care because the result shows that transparency about AI limitations can shift learner strategies without changing immediate performance measures such as error rates or time on task.

Core claim

In a classroom experiment with 252 school students using a math intelligent tutoring system, those who received a warning message about potential system errors requested significantly more hints compared to those who did not receive the warning, despite identical system behavior. The study found no corresponding differences in error rates or time-on-task.

What carries the argument

The warning message about potential AI errors as a transparency intervention that affects help-seeking behavior.

If this is right

Transparency interventions about AI fallibility can alter learners' help-seeking behavior in intelligent tutoring systems.
Such interventions influence interaction strategies without necessarily changing immediate performance metrics like error rates or time-on-task.
Lightweight warnings can be used to adjust how students engage with pedagogical agents in educational technology.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

System designers could add similar warnings to encourage greater use of available hints when over-reliance on AI output is a concern.
The same approach might be tested in other AI-assisted learning tools to see whether it reduces uncritical acceptance of responses.
Longer-term studies could examine whether the increase in hint requests from such warnings leads to measurable gains in learning outcomes.
The effect size might vary with student age or prior experience with AI tools, suggesting targeted versions of the warning.

Load-bearing premise

The classroom experiment isolated the warning message as the sole cause of increased help-seeking, with no unmeasured confounds from teacher effects, student grouping, or log data interpretation affecting the comparison between conditions.

What would settle it

A replication study using the same warning but a new set of students or classes that finds no difference in hint requests between warned and unwarned groups would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.03822 by Mirella Hladk\'y, Tomohiro Nagashima, Vera Rief.

**Figure 2.** Figure 2: Popup warning with a message 2.3 Procedure The study took place during two regular class periods (50 minutes each) of the “Information Technology and Data Science” course, which is held once a week for every class. Session 1 included an introduction of the study, pre-test on linear systems of equations (15 min), and students worked on the ITS for 15 minutes. Session 2 started with 15 minutes of continued p… view at source ↗

read the original abstract

Recent work in Technology-Enhanced Learning and Human-Computer Interaction highlights the importance of transparency and trust calibration in AI-supported learning environments as they pose a risk of hallucinations. In this study, we investigate whether a simple transparency intervention that warns students that a pedagogical agent may make mistakes affects learner behavior in a math intelligent tutoring system. We conducted a classroom experiment with 252 school students using two system versions: one including a warning message about potential system errors, and one that does not mention potential errors. Using log data, we analyzed students' problem-solving performance data, including help-seeking behavior, error rate, and time-on-task. Results show that students who were warned about potential AI errors requested significantly more hints than those in the other condition, even though the actual system behavior was exactly the same. This finding suggests that lightweight transparency interventions can influence learners' interaction strategies without necessarily improving or impairing immediate performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A classroom experiment found that warning students about AI mistakes increased hint requests in a math ITS, though assignment and analysis details are thin.

read the letter

The paper's main result is straightforward: in a study with 252 school students, those who saw a warning about possible AI errors asked for more hints than the no-warning group, even though the tutoring system itself behaved identically. Log data showed the difference in help-seeking.

The work does a clean job of testing a lightweight transparency intervention in a real classroom setting. It isolates the message as the variable and reports a significant shift in one specific behavior. That counts as a new empirical data point for people studying trust calibration in educational AI.

The soft spot is the lack of information on how conditions were assigned. Classroom experiments often put whole classes into one condition or the other, which can bring in teacher style, group composition, or prior exposure as confounds. The abstract mentions no randomization procedure or multilevel modeling to handle clustering, so the isolation of the warning effect is not yet secure. The other logged measures (error rate, time-on-task) are mentioned but not broken out, which leaves the practical meaning of the extra hints unclear.

This is useful for researchers working on AI in education and human-AI interaction in learning tools. A reader who needs concrete examples of how small interface changes affect student actions would get something from it.

It deserves peer review. The core claim is testable and the setup is simple enough that a referee can check the assignment and stats directly.

Referee Report

2 major / 0 minor

Summary. The paper reports results from a classroom experiment with 252 school students using a math intelligent tutoring system. It compares two conditions that differ only in the presence of a warning message about potential AI errors and finds, from log data, that the warned group requested significantly more hints while showing no differences in error rate or time-on-task. The authors conclude that a lightweight transparency intervention can increase help-seeking without altering immediate performance.

Significance. If the warning effect can be isolated from confounds, the result provides direct evidence that explicit statements about AI fallibility alter learner interaction strategies in real educational settings. The study’s use of authentic classroom log data and an otherwise identical system across conditions supplies a concrete, falsifiable demonstration relevant to trust calibration work in HCI and technology-enhanced learning.

major comments (2)

[Methods / Experiment description] The central claim—that the warning message alone caused the increase in help-seeking—requires that condition assignment isolated this variable. The manuscript does not report whether assignment occurred at the individual student level or at the class level, nor whether any multilevel or clustered analysis was performed to handle teacher or group effects. This detail is load-bearing for attributing the observed difference to the warning rather than unmeasured between-class variation.
[Results] Results section: the abstract states a significant difference in help-seeking but provides no information on the exact statistical test, degrees of freedom, effect size, or operational definition of the help-seeking measure (e.g., hints per problem, total hints, or proportion of problems on which help was requested). These omissions prevent verification that the reported difference supports the causal interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments, which highlight important gaps in methodological transparency. We will revise the manuscript to address both points directly.

read point-by-point responses

Referee: [Methods / Experiment description] The central claim—that the warning message alone caused the increase in help-seeking—requires that condition assignment isolated this variable. The manuscript does not report whether assignment occurred at the individual student level or at the class level, nor whether any multilevel or clustered analysis was performed to handle teacher or group effects. This detail is load-bearing for attributing the observed difference to the warning rather than unmeasured between-class variation.

Authors: We agree that the assignment procedure must be reported for causal attribution. Students were randomly assigned to conditions at the individual level within each class. The revised Methods section will explicitly describe this procedure and will include multilevel modeling results to account for class-level clustering; these analyses yield the same pattern of results for help-seeking. revision: yes
Referee: [Results] Results section: the abstract states a significant difference in help-seeking but provides no information on the exact statistical test, degrees of freedom, effect size, or operational definition of the help-seeking measure (e.g., hints per problem, total hints, or proportion of problems on which help was requested). These omissions prevent verification that the reported difference supports the causal interpretation.

Authors: We acknowledge these reporting omissions. The revised Results section and abstract will specify the operational definition (mean hints requested per problem), the exact test performed, degrees of freedom, p-value, and effect size, allowing full verification of the reported difference. revision: yes

Circularity Check

0 steps flagged

Empirical classroom experiment with no derivations or fitted predictions

full rationale

The paper reports results from a between-conditions classroom experiment (252 students, log data on help-seeking, error rate, time-on-task). The central claim is an observed statistical difference in hint requests between the warning and no-warning versions. No equations, parameter fitting, self-citations used as uniqueness theorems, or ansatzes appear in the derivation chain. The result is not constructed from prior inputs by definition or renaming; it rests on direct comparison of system logs under identical system behavior. Self-citations to prior TEL/HCI work are background only and not load-bearing for the reported finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical observation from a randomized classroom experiment and standard statistical comparison of log data; no free parameters, ad-hoc axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5691 in / 1061 out tokens · 25405 ms · 2026-06-28T08:20:32.594144+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages

[1]

Review of educational research , volume=

Effectiveness of intelligent tutoring systems: a meta-analytic review , author=. Review of educational research , volume=. 2016 , publisher=

2016
[2]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[3]

International Conference on Artificial Intelligence in Education , pages=

Beyond final answers: Evaluating large language models for math tutoring , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

2025
[4]

Computers and Education: Artificial Intelligence , volume=

Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

2024
[5]

Proceedings of the Twelfth ACM Conference on Learning@ Scale , pages=

When LLMs hallucinate: Examining the effects of erroneous feedback in math tutoring systems , author=. Proceedings of the Twelfth ACM Conference on Learning@ Scale , pages=
[6]

Learning and instruction , volume=

Finding and fixing errors in worked examples: Can this foster learning outcomes? , author=. Learning and instruction , volume=. 2007 , publisher=

2007
[7]

Learning and Instruction , volume=

Using example problems to improve student learning in algebra: Differentiating between correct and incorrect examples , author=. Learning and Instruction , volume=. 2013 , publisher=

2013
[8]

, author=

Using Anticipatory Diagrammatic Self-Explanation to Support Learning and Performance in Early Algebra. , author=. Grantee Submission , year=
[9]

International Conference on Intelligent Tutoring Systems , pages=

The cognitive tutor authoring tools (CTAT): Preliminary evaluation of efficiency gains , author=. International Conference on Intelligent Tutoring Systems , pages=. 2006 , organization=

2006
[10]

ACM Trans

Kosch, Thomas and Welsch, Robin and Chuang, Lewis and Schmidt, Albrecht , title =. ACM Trans. Comput.-Hum. Interact. , month = jan, articleno =. 2023 , issue_date =. doi:10.1145/3529225 , abstract =

work page doi:10.1145/3529225 2023
[11]

Unequal group sizes in randomised trials: guarding against guessing , journal =

Kenneth F Schulz and David A Grimes , abstract =. Unequal group sizes in randomised trials: guarding against guessing , journal =. 2002 , issn =. doi:https://doi.org/10.1016/S0140-6736(02)08029-7 , url =

work page doi:10.1016/s0140-6736(02)08029-7 2002
[12]

2026 , issn =

Research Note: Unequal randomisation in randomised trials , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.jphys.2025.11.009 , url =

work page doi:10.1016/j.jphys.2025.11.009 2026

[1] [1]

Review of educational research , volume=

Effectiveness of intelligent tutoring systems: a meta-analytic review , author=. Review of educational research , volume=. 2016 , publisher=

2016

[2] [2]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[3] [3]

International Conference on Artificial Intelligence in Education , pages=

Beyond final answers: Evaluating large language models for math tutoring , author=. International Conference on Artificial Intelligence in Education , pages=. 2025 , organization=

2025

[4] [4]

Computers and Education: Artificial Intelligence , volume=

Using LLMs to bring evidence-based feedback into the classroom: AI-generated feedback increases secondary students’ text revision, motivation, and positive emotions , author=. Computers and Education: Artificial Intelligence , volume=. 2024 , publisher=

2024

[5] [5]

Proceedings of the Twelfth ACM Conference on Learning@ Scale , pages=

When LLMs hallucinate: Examining the effects of erroneous feedback in math tutoring systems , author=. Proceedings of the Twelfth ACM Conference on Learning@ Scale , pages=

[6] [6]

Learning and instruction , volume=

Finding and fixing errors in worked examples: Can this foster learning outcomes? , author=. Learning and instruction , volume=. 2007 , publisher=

2007

[7] [7]

Learning and Instruction , volume=

Using example problems to improve student learning in algebra: Differentiating between correct and incorrect examples , author=. Learning and Instruction , volume=. 2013 , publisher=

2013

[8] [8]

, author=

Using Anticipatory Diagrammatic Self-Explanation to Support Learning and Performance in Early Algebra. , author=. Grantee Submission , year=

[9] [9]

International Conference on Intelligent Tutoring Systems , pages=

The cognitive tutor authoring tools (CTAT): Preliminary evaluation of efficiency gains , author=. International Conference on Intelligent Tutoring Systems , pages=. 2006 , organization=

2006

[10] [10]

ACM Trans

Kosch, Thomas and Welsch, Robin and Chuang, Lewis and Schmidt, Albrecht , title =. ACM Trans. Comput.-Hum. Interact. , month = jan, articleno =. 2023 , issue_date =. doi:10.1145/3529225 , abstract =

work page doi:10.1145/3529225 2023

[11] [11]

Unequal group sizes in randomised trials: guarding against guessing , journal =

Kenneth F Schulz and David A Grimes , abstract =. Unequal group sizes in randomised trials: guarding against guessing , journal =. 2002 , issn =. doi:https://doi.org/10.1016/S0140-6736(02)08029-7 , url =

work page doi:10.1016/s0140-6736(02)08029-7 2002

[12] [12]

2026 , issn =

Research Note: Unequal randomisation in randomised trials , journal =. 2026 , issn =. doi:https://doi.org/10.1016/j.jphys.2025.11.009 , url =

work page doi:10.1016/j.jphys.2025.11.009 2026