arxiv: 2604.26145 · v1 · submitted 2026-04-28 · 💻 cs.HC · cs.AI

Recognition: unknown

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ben Knight , Wm. Matthew Kennedy , James Edgell

Authors on Pith no claims yet

Pith reviewed 2026-05-07 15:05 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords AI explainabilitylanguage learningeducational feedbackexplanation failureshuman-AI interactionlearner harmseducational AIL2-Bench

0 comments

The pith

AI explanations in language learning tools often look helpful but contain flaws that can reinforce errors and erode trust.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how AI-powered language learning systems generate feedback that fails in ways learners and teachers struggle to spot. It introduces six dimensions of effective feedback drawn from the L2-Bench benchmark: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. Failures on these dimensions produce what the authors term explainability pitfalls, meaning explanations that appear useful on the surface yet rest on incorrect or incomplete reasoning. If the analysis holds, prolonged use of such tools risks leaving learners with reinforced misconceptions, weaker outcomes, and damaged confidence. The work highlights how the personal and ongoing nature of language learning makes these issues especially damaging and urges better evaluation methods for educational AI.

Core claim

AI systems providing language feedback can fail across the six dimensions of diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. These failures create explainability pitfalls: AI-generated explanations that appear helpful on the surface but are fundamentally flawed. In the language-learning setting such pitfalls raise the likelihood of attainment harms, human-AI interaction harms, and socioaffective harms, because learners may not detect the problems and teachers may not either. The paper maps concrete failure modes on each dimension and argues that the sustained, personal character of language study ampliﬁ

What carries the argument

Explainability pitfalls, defined as AI-generated explanations that appear helpful on the surface but are fundamentally flawed when evaluated against the six dimensions of effective language feedback.

If this is right

Learners can internalize incorrect rules or patterns without realizing the AI feedback is wrong.
Teachers may overlook the flaws when reviewing AI-generated responses.
Extended use of the tools can gradually worsen overall language proficiency.
The personal and repeated nature of language practice amplifies risks of reduced learner confidence and motivation.
Evaluation frameworks for AI explanations must incorporate domain-specific checks for these failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of surface-plausible but flawed explanations likely appears in AI tools for other school subjects.
Developers could add automated checks against the six dimensions to reduce the incidence of these pitfalls.
Controlled experiments with actual language learners would provide direct evidence on whether the pitfalls translate into measurable learning losses.

Load-bearing premise

The six dimensions fully capture the critical failure modes of AI feedback and these flawed explanations actually produce the claimed harms during real learner interactions.

What would settle it

A longitudinal study of language learners that tracks error persistence and motivation over months and finds no measurable difference between users of standard AI feedback and users of feedback known to fail on the six dimensions.

read the original abstract

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names six practical dimensions for spotting weak feedback in AI language tools but asserts harms from those failures without any examples, data, or causal evidence.

read the letter

The main takeaway is that this paper flags risks in AI-generated feedback for language learners but does so through a conceptual typology rather than any tested cases or outcomes. It introduces a slice of L2-Bench built around six dimensions—diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and self-regulation—and argues that failures here create explainability pitfalls that look helpful but can lead to attainment, interaction, or socioaffective problems. The language-learning context is noted as especially sensitive because learners may not catch the flaws themselves. That framing is straightforward and points to a real design issue for tools used by millions. The dimensions themselves feel grounded in how feedback actually works in second-language settings, which gives the list some immediate utility for developers checking their systems. The authors also leave open questions about evaluation frameworks, which keeps the piece from feeling closed off. The soft spot is the complete absence of grounding. No sample AI responses are dissected, no learner data or proxy measures appear, and the path from a flawed explanation to measurable harm is stated rather than traced. Without that step the central warning stays hypothetical. This is the sort of paper that could interest people working on educational AI or HCI applications in language tech. A reader already deep in XAI literature will see familiar concerns applied to a new domain, while someone building or evaluating tutors might borrow the dimensions as a quick checklist. It deserves a serious referee because the topic is timely and the dimensions are concrete enough to build on, even if the current version needs examples and at least preliminary validation to carry weight. I would send it out with a request for those additions rather than desk-reject it outright.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a portion of L2-Bench, a benchmark for evaluating AI systems in language education, organized around six dimensions of effective feedback (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation). It analyzes how AI-generated explanations can fail on these dimensions, framing such failures as 'explainability pitfalls'—superficially helpful but fundamentally flawed outputs—and argues that these increase risks of attainment, human-AI interaction, and socioaffective harms. The paper discusses how language-learning contexts amplify these risks and outlines open questions for designing evaluation frameworks.

Significance. If the typology of pitfalls is later validated with empirical data and the claimed causal pathways to learner harms are demonstrated, the work could help guide safer design of personalized feedback tools used by millions of language learners, expanding the community's understanding of undetectable explanation failures in educational AI.

major comments (3)

[Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.
[L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.
[Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.

minor comments (1)

[Abstract] Abstract: The list of harms ('attainment, human-AI interaction, and socioaffective harms') would benefit from explicit labeling as three distinct categories to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. Our manuscript is a conceptual contribution that proposes a typology of explainability pitfalls and outlines dimensions for L2-Bench, rather than an empirical validation study. We address each major comment below and will revise the paper accordingly to improve clarity and concreteness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.

Authors: We agree that the abstract presents the claims at a high level. The manuscript develops a typology through logical analysis of how failures on the six dimensions can produce superficially plausible but flawed explanations, with risks argued via pathways drawn from second-language acquisition and AI ethics literature. No new empirical data or causal studies are included because the paper's aim is to identify the typology and open questions to guide future work. We will revise the abstract to explicitly note its conceptual scope and add brief illustrative examples of AI explanation failures in the main text. revision: partial
Referee: [L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.

Authors: The manuscript introduces the six dimensions and discusses potential failure modes at the framework level. Specific benchmark items, protocols, and instantiated examples are part of the full L2-Bench development, planned for separate release. This paper focuses on the conceptual structure and pitfalls. We will add high-level evaluation protocol descriptions and concrete examples of AI outputs and failures for each dimension in the revised version to make the analysis more operational. revision: yes
Referee: [Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.

Authors: We acknowledge that the harms section is high-level. The three categories are hypothesized risks drawn from existing literature on educational AI and language learning, without new empirical demonstration of causality in this conceptual paper. In revision, we will add operational definitions, cite relevant proxies and studies for linkage to measurable outcomes, and clarify that the pathways are proposed to motivate future empirical work rather than asserted as proven. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive typology with no derivations or self-referential reductions

full rationale

The paper presents a conceptual framework and typology of explainability pitfalls in AI language learning feedback, organized around six dimensions of effective feedback. It contains no equations, fitted parameters, predictions derived from inputs, or mathematical derivations. The central claims rest on argumentative analysis of potential failure modes rather than any chain that reduces a result to its own definitions or prior self-citations. No load-bearing steps invoke self-citation for uniqueness theorems, smuggle ansatzes, or rename known results as novel derivations. The analysis is self-contained as a descriptive benchmark proposal and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on domain assumptions about what constitutes effective feedback and the existence of harms from flawed explanations, without independent evidence or prior citations supplied in the abstract.

axioms (1)

domain assumption The six dimensions (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation) are critical for effective feedback.
Presented as the basis for the L2-Bench benchmark in the abstract.

invented entities (1)

explainability pitfalls no independent evidence
purpose: To categorize AI explanations that appear helpful but are flawed in language learning contexts.
New framing introduced to describe the failure modes and their risks.

pith-pipeline@v0.9.0 · 5519 in / 1315 out tokens · 61129 ms · 2026-05-07T15:05:15.251367+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 20 canonical work pages · 1 internal anchor

[1]

Gavin Abercrombie, Alice Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. 2023. Mirages: On Anthropomorphism in Dialogue Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4776–4790

2023
[2]

Disha Agarwal, Mor Naaman, and Aditya Vashistha. 2024. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. arXiv:2409.11360

work page arXiv 2024
[3]

Anderson, David R

Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, NY

2001
[4]

1977.Social Learning Theory

Albert Bandura. 1977.Social Learning Theory. Prentice Hall, Englewood Cliffs, NJ

1977
[5]

Hamsa Bastani, Osbert Bastani, Alp Sungu, Haoyang Ge, Ozge Kabakci, and Rani Mariman. 2024. Generative AI Can Harm Learning. SSRN Working Paper, DOI: 10.2139/ssrn.4895486

work page doi:10.2139/ssrn.4895486 2024
[6]

Brookhart

Susan M. Brookhart. 2008.How to Give Effective Feedback to Your Students. ASCD, Alexandria, VA

2008
[7]

Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott

James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott. 2026. Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education. arXiv:2603.20088 [cs.CY] https://arxiv.org/abs/2603.20088

work page arXiv 2026
[8]

Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O

Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O. Riedl. 2024. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Article 316. doi:10.1145/3613904.3642474

work page doi:10.1145/3613904.3642474 2024
[9]

Upol Ehsan and Mark O. Riedl. 2021. Explainability Pitfalls: Beyond Dark Patterns in Explainable AI. InHuman-Centered AI Workshop at NeurIPS. arXiv:2109.12480

work page arXiv 2021
[10]

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, and Ari Holtzman. 2025. AbsenceBench: Language Models Can’t Tell What’s Missing. arXiv:2506.11440

work page arXiv 2025
[11]

Iason Gabriel, Andrea Manzini, Geoffrey Keeling, Lisa Anne Hendricks, Verena Rieser, Haroon Iqbal, Nenad Tomašev, Irina Ktena, Zachary Kenton, Manuel Rodriguez, Sam El-Sayed, Sarah Brown, Cansu Akbulut, Andrew Trask, Edward Hughes, Adam S. Bergman, Renee Shelby, Naomi Marchal, Casey Griffin, Juan Mateos-Garcia, Laura Weidinger, William Street, Benjamin La...

work page arXiv 2024
[12]

Michael Gerlich. 2025. AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.Societies15, 1 (2025), 6. doi:10.3390/soc15010006

work page doi:10.3390/soc15010006 2025
[13]

Graesser, Natalie K

Arthur C. Graesser, Natalie K. Person, and Joseph P. Magliano. 1995. Collaborative Dialogue Patterns in Naturalistic One-to-One Tutoring.Applied Cognitive Psychology9, 6 (1995), 495–522

1995
[14]

John Hattie and Helen Timperley. 2007. The Power of Feedback.Review of Educational Research77, 1 (2007), 81–112

2007
[15]

Wayne Holmes. 2024. AIED—Coming of Age?International Journal of Artificial Intelligence in Education34, 1 (2024), 1–11. doi:10.1007/s40593-023- 00352-3

work page doi:10.1007/s40593-023- 2024
[16]

2023.Guidance for Generative AI for Education and Research

Wayne Holmes and Fengchun Miao. 2023.Guidance for Generative AI for Education and Research. UNESCO

2023
[17]

Wayne Holmes and Ilkka Tuomi. 2022. State of the Art and Practice in AI in Education.European Journal of Education57, 4 (2022), 542–570

2022
[18]

Fiona Hyland. 1998. The Impact of Teacher Written Feedback on Individual Writers.Journal of Second Language Writing7, 3 (1998), 255–286

1998
[19]

McKee, Daniel Gillick, et al

Ivan Jurenka, Matthias Kunesch, Kyle R. McKee, Daniel Gillick, et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach. arXiv:2407.12687

work page arXiv 2024
[20]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664

work page internal anchor Pith review arXiv 2025
[21]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2024. Vernacularizing Taxonomies of Harm Is Essential for Operationalizing Holistic AI Safety. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 698–710

2024
[22]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2026. A Vernacularized Taxonomy of Harms for AI in Education. InHandbook of Critical Studies in AI for Education, Wayne Holmes and Caroline Pelletier (Eds.). Edward Elgar. Forthcoming

2026
[23]

Val Klenowski. 2009. Assessment for Learning Revisited: An Asia-Pacific Perspective.Assessment in Education: Principles, Policy & Practice16, 3 (2009), 263–268

2009
[24]

Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, and Sarah Amos. 2025. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific. InBuilding Trust Workshop at the International Conference on Learning Representations

2025
[25]

LearnLM Team. 2024. LearnLM: Improving Gemini for Learning. arXiv:2412.16429

work page arXiv 2024
[26]

LearnLM Team and Google. 2025. Evaluating Gemini in an Arena for Learning. arXiv:2505.24477v1 [cs.CY]

work page arXiv 2025
[27]

LearnLM Team, Google, and Eedi. 2025. AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms. Technical report

2025
[28]

McNamara, Laura K

Danielle S. McNamara, Laura K. Allen, Matthew E. Jacovina, and Aaron D. Likens. 2023. Leveraging Large Language Models for Language Learning. Journal of Learning Analytics10, 3 (2023), 1–15

2023
[29]

Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. 2025. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but May Increase Adopters’ Exam Performance. InProceedings of the Twelfth ACM Conference on Learning @ Scale (L@S ’25). Ass...

work page doi:10.1145/3698205.3733960 2025
[30]

1980.Mindstorms: Children, Computers, and Powerful Ideas

Seymour Papert. 1980.Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York, NY

1980
[31]

Alison Pease, Anna Zamansky, and Sarah Wiseman. 2023. Pedagogical Implications of Large Language Models: Challenges and Opportunities.AI & Society(2023), 1–14. doi:10.1007/s00146-023-01753-4

work page doi:10.1007/s00146-023-01753-4 2023
[32]

Chris Piech, Mehran Sahami, Daphne Koller, Steve Cooper, and Paulo Blikstein. 2015. Modeling How Students Learn to Program. InProceedings of the 46th ACM Technical Symposium on Computer Science Education. ACM, 153–158. doi:10.1145/2676723.2677308

work page doi:10.1145/2676723.2677308 2015
[33]

Schegloff and Harvey Sacks

Emanuel A. Schegloff and Harvey Sacks. 1973. Opening Up Closings.Semiotica8, 4 (1973), 289–327

1973
[34]

2019.Should Robots Replace Teachers? AI and the Future of Education

Neil Selwyn. 2019.Should Robots Replace Teachers? AI and the Future of Education. Polity Press, Cambridge, UK

2019
[35]

Valerie J. Shute. 2008. Focus on Formative Feedback.Review of Educational Research78, 1 (2008), 153–189

2008
[36]

Hopfenbeck

Gordon Stobart, Elaine Boyd, Anthony Green, and Therese N. Hopfenbeck. 2019.Effective Feedback: The Key to Successful Assessment for Learning. Oxford University Press

2019
[37]

Stefano Teso, Oznur Alkan, Wolfgang Stammer, and Elizabeth Daly. 2023. Leveraging Explanations in Interactive Machine Learning: An Overview. Frontiers in Artificial Intelligence6 (2023). doi:10.3389/frai.2023.1066049

work page doi:10.3389/frai.2023.1066049 2023
[38]

Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s Two-Dimensional Taxonomy Applied to Task Creation and Learning Assessment. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 117–124. doi:10.1145/3341981.3344226

work page doi:10.1145/3341981.3344226 2019
[39]

Kurt VanLehn. 2011. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.Educational Psychologist46, 4 (2011), 197–221. doi:10.1080/00461520.2011.611369

work page doi:10.1080/00461520.2011.611369 2011
[40]

Vygotsky

Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

1978
[41]

Laura Weidinger, Maximilian Rauh, Naomi Marchal, Andrea Manzini, Lisa Anne Hendricks, et al . 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv:2310.11986

work page arXiv 2023
[42]

2011.Embedded Formative Assessment

Dylan Wiliam. 2011.Embedded Formative Assessment. Solution Tree Press, Bloomington, IN

2011
[43]

Simon Woodhead, Simon Blatchford, and Michael Webb. 2023. Can AI Tutors Improve Learning Outcomes at Scale? Results from a Randomized Controlled Trial. InProceedings of the International Conference on Learning Analytics and Knowledge. ACM, 489–495. 8 Knight et al

2023
[44]

2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning

Beverly Park Woolf. 2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Burlington, MA

2010