Recognition: unknown
Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Pith reviewed 2026-05-07 15:05 UTC · model grok-4.3
The pith
AI explanations in language learning tools often look helpful but contain flaws that can reinforce errors and erode trust.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI systems providing language feedback can fail across the six dimensions of diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. These failures create explainability pitfalls: AI-generated explanations that appear helpful on the surface but are fundamentally flawed. In the language-learning setting such pitfalls raise the likelihood of attainment harms, human-AI interaction harms, and socioaffective harms, because learners may not detect the problems and teachers may not either. The paper maps concrete failure modes on each dimension and argues that the sustained, personal character of language study amplifi
What carries the argument
Explainability pitfalls, defined as AI-generated explanations that appear helpful on the surface but are fundamentally flawed when evaluated against the six dimensions of effective language feedback.
If this is right
- Learners can internalize incorrect rules or patterns without realizing the AI feedback is wrong.
- Teachers may overlook the flaws when reviewing AI-generated responses.
- Extended use of the tools can gradually worsen overall language proficiency.
- The personal and repeated nature of language practice amplifies risks of reduced learner confidence and motivation.
- Evaluation frameworks for AI explanations must incorporate domain-specific checks for these failure modes.
Where Pith is reading between the lines
- The same pattern of surface-plausible but flawed explanations likely appears in AI tools for other school subjects.
- Developers could add automated checks against the six dimensions to reduce the incidence of these pitfalls.
- Controlled experiments with actual language learners would provide direct evidence on whether the pitfalls translate into measurable learning losses.
Load-bearing premise
The six dimensions fully capture the critical failure modes of AI feedback and these flawed explanations actually produce the claimed harms during real learner interactions.
What would settle it
A longitudinal study of language learners that tracks error persistence and motivation over months and finds no measurable difference between users of standard AI feedback and users of feedback known to fail on the six dimensions.
read the original abstract
AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a portion of L2-Bench, a benchmark for evaluating AI systems in language education, organized around six dimensions of effective feedback (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation). It analyzes how AI-generated explanations can fail on these dimensions, framing such failures as 'explainability pitfalls'—superficially helpful but fundamentally flawed outputs—and argues that these increase risks of attainment, human-AI interaction, and socioaffective harms. The paper discusses how language-learning contexts amplify these risks and outlines open questions for designing evaluation frameworks.
Significance. If the typology of pitfalls is later validated with empirical data and the claimed causal pathways to learner harms are demonstrated, the work could help guide safer design of personalized feedback tools used by millions of language learners, expanding the community's understanding of undetectable explanation failures in educational AI.
major comments (3)
- [Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.
- [L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.
- [Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.
minor comments (1)
- [Abstract] Abstract: The list of harms ('attainment, human-AI interaction, and socioaffective harms') would benefit from explicit labeling as three distinct categories to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments. Our manuscript is a conceptual contribution that proposes a typology of explainability pitfalls and outlines dimensions for L2-Bench, rather than an empirical validation study. We address each major comment below and will revise the paper accordingly to improve clarity and concreteness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.
Authors: We agree that the abstract presents the claims at a high level. The manuscript develops a typology through logical analysis of how failures on the six dimensions can produce superficially plausible but flawed explanations, with risks argued via pathways drawn from second-language acquisition and AI ethics literature. No new empirical data or causal studies are included because the paper's aim is to identify the typology and open questions to guide future work. We will revise the abstract to explicitly note its conceptual scope and add brief illustrative examples of AI explanation failures in the main text. revision: partial
-
Referee: [L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.
Authors: The manuscript introduces the six dimensions and discusses potential failure modes at the framework level. Specific benchmark items, protocols, and instantiated examples are part of the full L2-Bench development, planned for separate release. This paper focuses on the conceptual structure and pitfalls. We will add high-level evaluation protocol descriptions and concrete examples of AI outputs and failures for each dimension in the revised version to make the analysis more operational. revision: yes
-
Referee: [Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.
Authors: We acknowledge that the harms section is high-level. The three categories are hypothesized risks drawn from existing literature on educational AI and language learning, without new empirical demonstration of causality in this conceptual paper. In revision, we will add operational definitions, cite relevant proxies and studies for linkage to measurable outcomes, and clarify that the pathways are proposed to motivate future empirical work rather than asserted as proven. revision: partial
Circularity Check
No circularity: purely descriptive typology with no derivations or self-referential reductions
full rationale
The paper presents a conceptual framework and typology of explainability pitfalls in AI language learning feedback, organized around six dimensions of effective feedback. It contains no equations, fitted parameters, predictions derived from inputs, or mathematical derivations. The central claims rest on argumentative analysis of potential failure modes rather than any chain that reduces a result to its own definitions or prior self-citations. No load-bearing steps invoke self-citation for uniqueness theorems, smuggle ansatzes, or rename known results as novel derivations. The analysis is self-contained as a descriptive benchmark proposal and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six dimensions (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation) are critical for effective feedback.
invented entities (1)
-
explainability pitfalls
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gavin Abercrombie, Alice Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. 2023. Mirages: On Anthropomorphism in Dialogue Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4776–4790
2023
- [2]
-
[3]
Anderson, David R
Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, NY
2001
-
[4]
1977.Social Learning Theory
Albert Bandura. 1977.Social Learning Theory. Prentice Hall, Englewood Cliffs, NJ
1977
-
[5]
Hamsa Bastani, Osbert Bastani, Alp Sungu, Haoyang Ge, Ozge Kabakci, and Rani Mariman. 2024. Generative AI Can Harm Learning. SSRN Working Paper, DOI: 10.2139/ssrn.4895486
-
[6]
Brookhart
Susan M. Brookhart. 2008.How to Give Effective Feedback to Your Students. ASCD, Alexandria, VA
2008
-
[7]
Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott
James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott. 2026. Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education. arXiv:2603.20088 [cs.CY] https://arxiv.org/abs/2603.20088
-
[8]
Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O
Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O. Riedl. 2024. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Article 316. doi:10.1145/3613904.3642474
- [9]
- [10]
-
[11]
Iason Gabriel, Andrea Manzini, Geoffrey Keeling, Lisa Anne Hendricks, Verena Rieser, Haroon Iqbal, Nenad Tomašev, Irina Ktena, Zachary Kenton, Manuel Rodriguez, Sam El-Sayed, Sarah Brown, Cansu Akbulut, Andrew Trask, Edward Hughes, Adam S. Bergman, Renee Shelby, Naomi Marchal, Casey Griffin, Juan Mateos-Garcia, Laura Weidinger, William Street, Benjamin La...
-
[12]
Michael Gerlich. 2025. AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.Societies15, 1 (2025), 6. doi:10.3390/soc15010006
-
[13]
Graesser, Natalie K
Arthur C. Graesser, Natalie K. Person, and Joseph P. Magliano. 1995. Collaborative Dialogue Patterns in Naturalistic One-to-One Tutoring.Applied Cognitive Psychology9, 6 (1995), 495–522
1995
-
[14]
John Hattie and Helen Timperley. 2007. The Power of Feedback.Review of Educational Research77, 1 (2007), 81–112
2007
-
[15]
Wayne Holmes. 2024. AIED—Coming of Age?International Journal of Artificial Intelligence in Education34, 1 (2024), 1–11. doi:10.1007/s40593-023- 00352-3
-
[16]
2023.Guidance for Generative AI for Education and Research
Wayne Holmes and Fengchun Miao. 2023.Guidance for Generative AI for Education and Research. UNESCO
2023
-
[17]
Wayne Holmes and Ilkka Tuomi. 2022. State of the Art and Practice in AI in Education.European Journal of Education57, 4 (2022), 542–570
2022
-
[18]
Fiona Hyland. 1998. The Impact of Teacher Written Feedback on Individual Writers.Journal of Second Language Writing7, 3 (1998), 255–286
1998
-
[19]
Ivan Jurenka, Matthias Kunesch, Kyle R. McKee, Daniel Gillick, et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach. arXiv:2407.12687
-
[20]
Why Language Models Hallucinate
Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664
work page internal anchor Pith review arXiv 2025
-
[21]
Matthew Kennedy and Daniel Vargas Campos
Wm. Matthew Kennedy and Daniel Vargas Campos. 2024. Vernacularizing Taxonomies of Harm Is Essential for Operationalizing Holistic AI Safety. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 698–710
2024
-
[22]
Matthew Kennedy and Daniel Vargas Campos
Wm. Matthew Kennedy and Daniel Vargas Campos. 2026. A Vernacularized Taxonomy of Harms for AI in Education. InHandbook of Critical Studies in AI for Education, Wayne Holmes and Caroline Pelletier (Eds.). Edward Elgar. Forthcoming
2026
-
[23]
Val Klenowski. 2009. Assessment for Learning Revisited: An Asia-Pacific Perspective.Assessment in Education: Principles, Policy & Practice16, 3 (2009), 263–268
2009
-
[24]
Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, and Sarah Amos. 2025. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific. InBuilding Trust Workshop at the International Conference on Learning Representations
2025
- [25]
- [26]
-
[27]
LearnLM Team, Google, and Eedi. 2025. AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms. Technical report
2025
-
[28]
McNamara, Laura K
Danielle S. McNamara, Laura K. Allen, Matthew E. Jacovina, and Aaron D. Likens. 2023. Leveraging Large Language Models for Language Learning. Journal of Learning Analytics10, 3 (2023), 1–15
2023
-
[29]
Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. 2025. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but May Increase Adopters’ Exam Performance. InProceedings of the Twelfth ACM Conference on Learning @ Scale (L@S ’25). Ass...
-
[30]
1980.Mindstorms: Children, Computers, and Powerful Ideas
Seymour Papert. 1980.Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York, NY
1980
-
[31]
Alison Pease, Anna Zamansky, and Sarah Wiseman. 2023. Pedagogical Implications of Large Language Models: Challenges and Opportunities.AI & Society(2023), 1–14. doi:10.1007/s00146-023-01753-4
-
[32]
Chris Piech, Mehran Sahami, Daphne Koller, Steve Cooper, and Paulo Blikstein. 2015. Modeling How Students Learn to Program. InProceedings of the 46th ACM Technical Symposium on Computer Science Education. ACM, 153–158. doi:10.1145/2676723.2677308
-
[33]
Schegloff and Harvey Sacks
Emanuel A. Schegloff and Harvey Sacks. 1973. Opening Up Closings.Semiotica8, 4 (1973), 289–327
1973
-
[34]
2019.Should Robots Replace Teachers? AI and the Future of Education
Neil Selwyn. 2019.Should Robots Replace Teachers? AI and the Future of Education. Polity Press, Cambridge, UK
2019
-
[35]
Valerie J. Shute. 2008. Focus on Formative Feedback.Review of Educational Research78, 1 (2008), 153–189
2008
-
[36]
Hopfenbeck
Gordon Stobart, Elaine Boyd, Anthony Green, and Therese N. Hopfenbeck. 2019.Effective Feedback: The Key to Successful Assessment for Learning. Oxford University Press
2019
-
[37]
Stefano Teso, Oznur Alkan, Wolfgang Stammer, and Elizabeth Daly. 2023. Leveraging Explanations in Interactive Machine Learning: An Overview. Frontiers in Artificial Intelligence6 (2023). doi:10.3389/frai.2023.1066049
-
[38]
Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s Two-Dimensional Taxonomy Applied to Task Creation and Learning Assessment. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 117–124. doi:10.1145/3341981.3344226
-
[39]
Kurt VanLehn. 2011. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.Educational Psychologist46, 4 (2011), 197–221. doi:10.1080/00461520.2011.611369
-
[40]
Vygotsky
Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA
1978
- [41]
-
[42]
2011.Embedded Formative Assessment
Dylan Wiliam. 2011.Embedded Formative Assessment. Solution Tree Press, Bloomington, IN
2011
-
[43]
Simon Woodhead, Simon Blatchford, and Michael Webb. 2023. Can AI Tutors Improve Learning Outcomes at Scale? Results from a Randomized Controlled Trial. InProceedings of the International Conference on Learning Analytics and Knowledge. ACM, 489–495. 8 Knight et al
2023
-
[44]
2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning
Beverly Park Woolf. 2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Burlington, MA
2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.