Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ben Knight; Danielle Carvalho; Isaac Pattis; James Edgell; Wm. Matthew Kennedy

arxiv: 2604.26145 · v2 · pith:ZX4LCNNBnew · submitted 2026-04-28 · 💻 cs.HC · cs.AI

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ben Knight , Wm. Matthew Kennedy , Danielle Carvalho , Isaac Pattis , James Edgell This is my paper

Pith reviewed 2026-05-07 15:05 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords AI explainabilitylanguage learningeducational feedbackexplanation failureshuman-AI interactionlearner harmseducational AIL2-Bench

0 comments

The pith

AI explanations in language learning tools often look helpful but contain flaws that can reinforce errors and erode trust.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how AI-powered language learning systems generate feedback that fails in ways learners and teachers struggle to spot. It introduces six dimensions of effective feedback drawn from the L2-Bench benchmark: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. Failures on these dimensions produce what the authors term explainability pitfalls, meaning explanations that appear useful on the surface yet rest on incorrect or incomplete reasoning. If the analysis holds, prolonged use of such tools risks leaving learners with reinforced misconceptions, weaker outcomes, and damaged confidence. The work highlights how the personal and ongoing nature of language learning makes these issues especially damaging and urges better evaluation methods for educational AI.

Core claim

AI systems providing language feedback can fail across the six dimensions of diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. These failures create explainability pitfalls: AI-generated explanations that appear helpful on the surface but are fundamentally flawed. In the language-learning setting such pitfalls raise the likelihood of attainment harms, human-AI interaction harms, and socioaffective harms, because learners may not detect the problems and teachers may not either. The paper maps concrete failure modes on each dimension and argues that the sustained, personal character of language study ampliﬁ

What carries the argument

Explainability pitfalls, defined as AI-generated explanations that appear helpful on the surface but are fundamentally flawed when evaluated against the six dimensions of effective language feedback.

If this is right

Learners can internalize incorrect rules or patterns without realizing the AI feedback is wrong.
Teachers may overlook the flaws when reviewing AI-generated responses.
Extended use of the tools can gradually worsen overall language proficiency.
The personal and repeated nature of language practice amplifies risks of reduced learner confidence and motivation.
Evaluation frameworks for AI explanations must incorporate domain-specific checks for these failure modes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of surface-plausible but flawed explanations likely appears in AI tools for other school subjects.
Developers could add automated checks against the six dimensions to reduce the incidence of these pitfalls.
Controlled experiments with actual language learners would provide direct evidence on whether the pitfalls translate into measurable learning losses.

Load-bearing premise

The six dimensions fully capture the critical failure modes of AI feedback and these flawed explanations actually produce the claimed harms during real learner interactions.

What would settle it

A longitudinal study of language learners that tracks error persistence and motivation over months and finds no measurable difference between users of standard AI feedback and users of feedback known to fail on the six dimensions.

read the original abstract

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names six practical dimensions for spotting weak feedback in AI language tools but asserts harms from those failures without any examples, data, or causal evidence.

read the letter

The main takeaway is that this paper flags risks in AI-generated feedback for language learners but does so through a conceptual typology rather than any tested cases or outcomes. It introduces a slice of L2-Bench built around six dimensions—diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and self-regulation—and argues that failures here create explainability pitfalls that look helpful but can lead to attainment, interaction, or socioaffective problems. The language-learning context is noted as especially sensitive because learners may not catch the flaws themselves. That framing is straightforward and points to a real design issue for tools used by millions. The dimensions themselves feel grounded in how feedback actually works in second-language settings, which gives the list some immediate utility for developers checking their systems. The authors also leave open questions about evaluation frameworks, which keeps the piece from feeling closed off. The soft spot is the complete absence of grounding. No sample AI responses are dissected, no learner data or proxy measures appear, and the path from a flawed explanation to measurable harm is stated rather than traced. Without that step the central warning stays hypothetical. This is the sort of paper that could interest people working on educational AI or HCI applications in language tech. A reader already deep in XAI literature will see familiar concerns applied to a new domain, while someone building or evaluating tutors might borrow the dimensions as a quick checklist. It deserves a serious referee because the topic is timely and the dimensions are concrete enough to build on, even if the current version needs examples and at least preliminary validation to carry weight. I would send it out with a request for those additions rather than desk-reject it outright.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a portion of L2-Bench, a benchmark for evaluating AI systems in language education, organized around six dimensions of effective feedback (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation). It analyzes how AI-generated explanations can fail on these dimensions, framing such failures as 'explainability pitfalls'—superficially helpful but fundamentally flawed outputs—and argues that these increase risks of attainment, human-AI interaction, and socioaffective harms. The paper discusses how language-learning contexts amplify these risks and outlines open questions for designing evaluation frameworks.

Significance. If the typology of pitfalls is later validated with empirical data and the claimed causal pathways to learner harms are demonstrated, the work could help guide safer design of personalized feedback tools used by millions of language learners, expanding the community's understanding of undetectable explanation failures in educational AI.

major comments (3)

[Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.
[L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.
[Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.

minor comments (1)

[Abstract] Abstract: The list of harms ('attainment, human-AI interaction, and socioaffective harms') would benefit from explicit labeling as three distinct categories to avoid ambiguity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. Our manuscript is a conceptual contribution that proposes a typology of explainability pitfalls and outlines dimensions for L2-Bench, rather than an empirical validation study. We address each major comment below and will revise the paper accordingly to improve clarity and concreteness.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that failures on the six dimensions produce explainability pitfalls that increase attainment, human-AI interaction, and socioaffective harms is asserted without any concrete examples, benchmark data, learner studies, or causal mechanisms, leaving the argument as a conceptual typology rather than an evidence-based analysis.

Authors: We agree that the abstract presents the claims at a high level. The manuscript develops a typology through logical analysis of how failures on the six dimensions can produce superficially plausible but flawed explanations, with risks argued via pathways drawn from second-language acquisition and AI ethics literature. No new empirical data or causal studies are included because the paper's aim is to identify the typology and open questions to guide future work. We will revise the abstract to explicitly note its conceptual scope and add brief illustrative examples of AI explanation failures in the main text. revision: partial
Referee: [L2-Bench description] L2-Bench presentation: Although the manuscript states that it presents a portion of L2-Bench, no specific benchmark items, evaluation protocols, AI output examples, or failure instances on the listed dimensions are supplied, which is required to make the analysis of AI failures operational and testable.

Authors: The manuscript introduces the six dimensions and discusses potential failure modes at the framework level. Specific benchmark items, protocols, and instantiated examples are part of the full L2-Bench development, planned for separate release. This paper focuses on the conceptual structure and pitfalls. We will add high-level evaluation protocol descriptions and concrete examples of AI outputs and failures for each dimension in the revised version to make the analysis more operational. revision: yes
Referee: [Discussion of harms] Harms discussion: The three harm categories lack operational definitions, proxies, or any linkage to measurable outcomes; the manuscript provides no evidence that surface-plausible but incorrect feedback on the six dimensions actually produces the claimed negative effects in real learner interactions.

Authors: We acknowledge that the harms section is high-level. The three categories are hypothesized risks drawn from existing literature on educational AI and language learning, without new empirical demonstration of causality in this conceptual paper. In revision, we will add operational definitions, cite relevant proxies and studies for linkage to measurable outcomes, and clarify that the pathways are proposed to motivate future empirical work rather than asserted as proven. revision: partial

Circularity Check

0 steps flagged

No circularity: purely descriptive typology with no derivations or self-referential reductions

full rationale

The paper presents a conceptual framework and typology of explainability pitfalls in AI language learning feedback, organized around six dimensions of effective feedback. It contains no equations, fitted parameters, predictions derived from inputs, or mathematical derivations. The central claims rest on argumentative analysis of potential failure modes rather than any chain that reduces a result to its own definitions or prior self-citations. No load-bearing steps invoke self-citation for uniqueness theorems, smuggle ansatzes, or rename known results as novel derivations. The analysis is self-contained as a descriptive benchmark proposal and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on domain assumptions about what constitutes effective feedback and the existence of harms from flawed explanations, without independent evidence or prior citations supplied in the abstract.

axioms (1)

domain assumption The six dimensions (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation) are critical for effective feedback.
Presented as the basis for the L2-Bench benchmark in the abstract.

invented entities (1)

explainability pitfalls no independent evidence
purpose: To categorize AI explanations that appear helpful but are flawed in language learning contexts.
New framing introduced to describe the failure modes and their risks.

pith-pipeline@v0.9.0 · 5519 in / 1315 out tokens · 61129 ms · 2026-05-07T15:05:15.251367+00:00 · methodology

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)