Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems
Pith reviewed 2026-05-25 06:28 UTC · model grok-4.3
The pith
AI feedback for language learners can appear helpful while failing on accuracy, error causes and improvement guidance, creating hidden risks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AI systems supplying instant personalised feedback in language learning can fail on six critical dimensions—diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation—producing explanations that appear helpful yet are fundamentally flawed; the paper terms these explainability pitfalls and states that they increase risks of attainment, human-AI interaction, and socioaffective harms, with language-learning settings heightening the exposure.
What carries the argument
The six dimensions of effective feedback from L2-Bench, used to classify AI explanation failures as explainability pitfalls that look sound but undermine learning.
If this is right
- AI developers must evaluate explanations against all six feedback dimensions rather than surface-level helpfulness alone.
- Language-learning tools require safeguards that detect and correct failures in error-cause identification and self-regulation support.
- Undetected pitfalls can compound over months of daily use, eroding learner outcomes in ways teachers may not notice.
- Evaluation frameworks for educational AI should incorporate language-specific risks of socioaffective and interaction harms.
Where Pith is reading between the lines
- Current benchmarks that only measure accuracy of answers may miss these explanation pitfalls entirely.
- The same failure patterns could appear in AI tutoring systems for other subjects where explanatory feedback is central.
- Designers might reduce risks by building explicit checks for each of the six dimensions into model training or post-generation review.
Load-bearing premise
That the six listed dimensions capture the critical aspects of effective feedback and that the described failures reliably produce the claimed harms in language learning contexts.
What would settle it
A longitudinal study of language learners using AI feedback that tracks whether those receiving the identified failure types develop more persistent misconceptions or slower progress than learners receiving accurate feedback on the same dimensions.
read the original abstract
AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a portion of L2-Bench, a benchmark for AI systems in language education, centered on six dimensions of effective feedback (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation). It analyzes how AI-generated explanations can fail on these dimensions, arguing that such failures constitute 'explainability pitfalls'—explanations that appear helpful but are flawed—and that these increase risks of attainment, human-AI interaction, and socioaffective harms, with language-learning contexts amplifying the issues. The paper discusses these dynamics and outlines open questions for designing evaluation frameworks.
Significance. If the proposed typology of explainability pitfalls proves robust and the contextual risks are borne out, the work could usefully inform safer AI design in educational tools by highlighting failure modes specific to language learning. The framing as analysis plus open questions is a constructive contribution that may stimulate targeted follow-up research on evaluation methods.
major comments (2)
- [Abstract] Abstract: The central claim that the six dimensions capture critical aspects of effective feedback and that the described AI failures reliably produce the listed harms (attainment, human-AI interaction, socioaffective) is presented as an argument without any concrete examples, operationalization details from L2-Bench, or supporting analysis; this premise is load-bearing for the typology and the call for new evaluation frameworks.
- [Discussion of contextual dynamics] The manuscript asserts that language-learning contexts amplify the risks of these pitfalls, yet provides no comparative discussion or references to prior work on feedback effectiveness in SLA (second language acquisition) that would ground why the six dimensions are exhaustive or why the harms follow in this domain specifically.
minor comments (2)
- The term 'attainment' harms is used without definition; a brief clarification of what is meant by this category would improve readability.
- The abstract states the benchmark 'includes (but is not limited to)' the six dimensions; if the full manuscript expands on additional dimensions or the benchmark structure, a short summary table would aid clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which highlight opportunities to strengthen the manuscript's clarity and grounding. We address each major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the six dimensions capture critical aspects of effective feedback and that the described AI failures reliably produce the listed harms (attainment, human-AI interaction, socioaffective) is presented as an argument without any concrete examples, operationalization details from L2-Bench, or supporting analysis; this premise is load-bearing for the typology and the call for new evaluation frameworks.
Authors: We agree that the abstract, constrained by length, presents the central claims at a high level without concrete examples or L2-Bench operationalization details. The body of the manuscript provides the supporting analysis, typology, and benchmark details. To address the concern that these claims are load-bearing, we will revise the abstract to incorporate brief illustrative examples of the pitfalls and their associated harms, while retaining its summary nature. This change will better foreground the empirical basis for the typology. revision: yes
-
Referee: [Discussion of contextual dynamics] The manuscript asserts that language-learning contexts amplify the risks of these pitfalls, yet provides no comparative discussion or references to prior work on feedback effectiveness in SLA (second language acquisition) that would ground why the six dimensions are exhaustive or why the harms follow in this domain specifically.
Authors: We acknowledge the validity of this observation. The manuscript discusses amplification based on domain-specific characteristics of language learning but does not include explicit comparative analysis or citations to SLA literature on feedback effectiveness. In revision, we will expand the relevant discussion section to incorporate key references from SLA research on corrective feedback and self-regulation, and to articulate why the six dimensions are particularly salient and why the identified harms are amplified in this context relative to other educational domains. We do not claim the dimensions are exhaustive, only critical; the revision will clarify this framing. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a qualitative typology of explanation failures in AI language-learning tools, drawing on six dimensions from L2-Bench and discussing associated risks. No equations, derivations, fitted parameters, predictions, or self-referential claims appear in the abstract or described structure. The work frames its contribution as analysis and open questions rather than any result obtained by construction from its own inputs. No load-bearing self-citations or ansatzes are referenced in the provided text.
Axiom & Free-Parameter Ledger
invented entities (1)
-
explainability pitfalls
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gavin Abercrombie, Alice Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. 2023. Mirages: On Anthropomorphism in Dialogue Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4776–4790
work page 2023
- [2]
-
[3]
Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, NY
work page 2001
-
[4]
Albert Bandura. 1977.Social Learning Theory. Prentice Hall, Englewood Cliffs, NJ
work page 1977
-
[5]
Hamsa Bastani, Osbert Bastani, Alp Sungu, Haoyang Ge, Ozge Kabakci, and Rani Mariman. 2024. Generative AI Can Harm Learning. SSRN Working Paper, DOI: 10.2139/ssrn.4895486
- [6]
-
[7]
James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott. 2026. Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education. arXiv:2603.20088 [cs.CY] https://arxiv.org/abs/2603.20088
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O
Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O. Riedl. 2024. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Article 316. doi:10.1145/3613904.3642474
- [9]
- [10]
-
[11]
Iason Gabriel, Andrea Manzini, Geoffrey Keeling, Lisa Anne Hendricks, Verena Rieser, Haroon Iqbal, Nenad Tomašev, Irina Ktena, Zachary Kenton, Manuel Rodriguez, Sam El-Sayed, Sarah Brown, Cansu Akbulut, Andrew Trask, Edward Hughes, Adam S. Bergman, Renee Shelby, Naomi Marchal, Casey Griffin, Juan Mateos-Garcia, Laura Weidinger, William Street, Benjamin La...
-
[12]
Michael Gerlich. 2025. AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.Societies15, 1 (2025), 6. doi:10.3390/soc15010006
-
[13]
Arthur C. Graesser, Natalie K. Person, and Joseph P. Magliano. 1995. Collaborative Dialogue Patterns in Naturalistic One-to-One Tutoring.Applied Cognitive Psychology9, 6 (1995), 495–522
work page 1995
-
[14]
John Hattie and Helen Timperley. 2007. The Power of Feedback.Review of Educational Research77, 1 (2007), 81–112
work page 2007
-
[15]
Wayne Holmes. 2024. AIED—Coming of Age?International Journal of Artificial Intelligence in Education34, 1 (2024), 1–11. doi:10.1007/s40593-023- 00352-3
-
[16]
2023.Guidance for Generative AI for Education and Research
Wayne Holmes and Fengchun Miao. 2023.Guidance for Generative AI for Education and Research. UNESCO
work page 2023
-
[17]
Wayne Holmes and Ilkka Tuomi. 2022. State of the Art and Practice in AI in Education.European Journal of Education57, 4 (2022), 542–570
work page 2022
-
[18]
Fiona Hyland. 1998. The Impact of Teacher Written Feedback on Individual Writers.Journal of Second Language Writing7, 3 (1998), 255–286
work page 1998
-
[19]
Ivan Jurenka, Matthias Kunesch, Kyle R. McKee, Daniel Gillick, et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach. arXiv:2407.12687
-
[20]
Why Language Models Hallucinate
Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[21]
Matthew Kennedy and Daniel Vargas Campos
Wm. Matthew Kennedy and Daniel Vargas Campos. 2024. Vernacularizing Taxonomies of Harm Is Essential for Operationalizing Holistic AI Safety. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 698–710
work page 2024
-
[22]
Matthew Kennedy and Daniel Vargas Campos
Wm. Matthew Kennedy and Daniel Vargas Campos. 2026. A Vernacularized Taxonomy of Harms for AI in Education. InHandbook of Critical Studies in AI for Education, Wayne Holmes and Caroline Pelletier (Eds.). Edward Elgar. Forthcoming
work page 2026
-
[23]
Val Klenowski. 2009. Assessment for Learning Revisited: An Asia-Pacific Perspective.Assessment in Education: Principles, Policy & Practice16, 3 (2009), 263–268
work page 2009
-
[24]
Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, and Sarah Amos. 2025. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific. InBuilding Trust Workshop at the International Conference on Learning Representations
work page 2025
- [25]
- [26]
-
[27]
LearnLM Team, Google, and Eedi. 2025. AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms. Technical report
work page 2025
-
[28]
Danielle S. McNamara, Laura K. Allen, Matthew E. Jacovina, and Aaron D. Likens. 2023. Leveraging Large Language Models for Language Learning. Journal of Learning Analytics10, 3 (2023), 1–15
work page 2023
-
[29]
Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. 2025. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but May Increase Adopters’ Exam Performance. InProceedings of the Twelfth ACM Conference on Learning @ Scale (L@S ’25). Ass...
-
[30]
1980.Mindstorms: Children, Computers, and Powerful Ideas
Seymour Papert. 1980.Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York, NY
work page 1980
-
[31]
Alison Pease, Anna Zamansky, and Sarah Wiseman. 2023. Pedagogical Implications of Large Language Models: Challenges and Opportunities.AI & Society(2023), 1–14. doi:10.1007/s00146-023-01753-4
-
[32]
Chris Piech, Mehran Sahami, Daphne Koller, Steve Cooper, and Paulo Blikstein. 2015. Modeling How Students Learn to Program. InProceedings of the 46th ACM Technical Symposium on Computer Science Education. ACM, 153–158. doi:10.1145/2676723.2677308
-
[33]
Emanuel A. Schegloff and Harvey Sacks. 1973. Opening Up Closings.Semiotica8, 4 (1973), 289–327
work page 1973
-
[34]
2019.Should Robots Replace Teachers? AI and the Future of Education
Neil Selwyn. 2019.Should Robots Replace Teachers? AI and the Future of Education. Polity Press, Cambridge, UK
work page 2019
-
[35]
Valerie J. Shute. 2008. Focus on Formative Feedback.Review of Educational Research78, 1 (2008), 153–189
work page 2008
-
[36]
Gordon Stobart, Elaine Boyd, Anthony Green, and Therese N. Hopfenbeck. 2019.Effective Feedback: The Key to Successful Assessment for Learning. Oxford University Press
work page 2019
-
[37]
Stefano Teso, Oznur Alkan, Wolfgang Stammer, and Elizabeth Daly. 2023. Leveraging Explanations in Interactive Machine Learning: An Overview. Frontiers in Artificial Intelligence6 (2023). doi:10.3389/frai.2023.1066049
-
[38]
Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s Two-Dimensional Taxonomy Applied to Task Creation and Learning Assessment. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 117–124. doi:10.1145/3341981.3344226
-
[39]
Kurt VanLehn. 2011. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.Educational Psychologist46, 4 (2011), 197–221. doi:10.1080/00461520.2011.611369
- [40]
- [41]
-
[42]
2011.Embedded Formative Assessment
Dylan Wiliam. 2011.Embedded Formative Assessment. Solution Tree Press, Bloomington, IN
work page 2011
-
[43]
Simon Woodhead, Simon Blatchford, and Michael Webb. 2023. Can AI Tutors Improve Learning Outcomes at Scale? Results from a Randomized Controlled Trial. InProceedings of the International Conference on Learning Analytics and Knowledge. ACM, 489–495. 8 Knight et al
work page 2023
-
[44]
Beverly Park Woolf. 2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Burlington, MA
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.