Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ben Knight; Danielle Carvalho; Isaac Pattis; James Edgell; Wm. Matthew Kennedy

arxiv: 2604.26145 · v2 · pith:ZX4LCNNBnew · submitted 2026-04-28 · 💻 cs.HC · cs.AI

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ben Knight , Wm. Matthew Kennedy , Danielle Carvalho , Isaac Pattis , James Edgell This is my paper

Pith reviewed 2026-05-25 06:28 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords explainability pitfallsAI feedbacklanguage learningdiagnostic accuracyL2-Benchlearner harmsself-regulationerror causes

0 comments

The pith

AI feedback for language learners can appear helpful while failing on accuracy, error causes and improvement guidance, creating hidden risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how AI systems in language education generate explanations that fail across six dimensions of effective feedback drawn from L2-Bench. These failures produce explanations that look useful on the surface but are flawed in diagnostic accuracy, awareness of appropriacy, identification of error causes, prioritisation, guidance for improvement, and support for self-regulation. The authors argue that such failures constitute explainability pitfalls that raise the chance of attainment harms, problematic human-AI interactions, and socioaffective harms. A sympathetic reader would care because millions of learners rely on these tools daily and undetected flaws could reinforce misconceptions over extended periods. The work highlights how language-learning contexts amplify these issues and calls for evaluation frameworks that address them directly.

Core claim

AI systems supplying instant personalised feedback in language learning can fail on six critical dimensions—diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation—producing explanations that appear helpful yet are fundamentally flawed; the paper terms these explainability pitfalls and states that they increase risks of attainment, human-AI interaction, and socioaffective harms, with language-learning settings heightening the exposure.

What carries the argument

The six dimensions of effective feedback from L2-Bench, used to classify AI explanation failures as explainability pitfalls that look sound but undermine learning.

If this is right

AI developers must evaluate explanations against all six feedback dimensions rather than surface-level helpfulness alone.
Language-learning tools require safeguards that detect and correct failures in error-cause identification and self-regulation support.
Undetected pitfalls can compound over months of daily use, eroding learner outcomes in ways teachers may not notice.
Evaluation frameworks for educational AI should incorporate language-specific risks of socioaffective and interaction harms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current benchmarks that only measure accuracy of answers may miss these explanation pitfalls entirely.
The same failure patterns could appear in AI tutoring systems for other subjects where explanatory feedback is central.
Designers might reduce risks by building explicit checks for each of the six dimensions into model training or post-generation review.

Load-bearing premise

That the six listed dimensions capture the critical aspects of effective feedback and that the described failures reliably produce the claimed harms in language learning contexts.

What would settle it

A longitudinal study of language learners using AI feedback that tracks whether those receiving the identified failure types develop more persistent misconceptions or slower progress than learners receiving accurate feedback on the same dimensions.

read the original abstract

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a portion of L2-Bench, a benchmark for AI systems in language education, centered on six dimensions of effective feedback (diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation). It analyzes how AI-generated explanations can fail on these dimensions, arguing that such failures constitute 'explainability pitfalls'—explanations that appear helpful but are flawed—and that these increase risks of attainment, human-AI interaction, and socioaffective harms, with language-learning contexts amplifying the issues. The paper discusses these dynamics and outlines open questions for designing evaluation frameworks.

Significance. If the proposed typology of explainability pitfalls proves robust and the contextual risks are borne out, the work could usefully inform safer AI design in educational tools by highlighting failure modes specific to language learning. The framing as analysis plus open questions is a constructive contribution that may stimulate targeted follow-up research on evaluation methods.

major comments (2)

[Abstract] Abstract: The central claim that the six dimensions capture critical aspects of effective feedback and that the described AI failures reliably produce the listed harms (attainment, human-AI interaction, socioaffective) is presented as an argument without any concrete examples, operationalization details from L2-Bench, or supporting analysis; this premise is load-bearing for the typology and the call for new evaluation frameworks.
[Discussion of contextual dynamics] The manuscript asserts that language-learning contexts amplify the risks of these pitfalls, yet provides no comparative discussion or references to prior work on feedback effectiveness in SLA (second language acquisition) that would ground why the six dimensions are exhaustive or why the harms follow in this domain specifically.

minor comments (2)

The term 'attainment' harms is used without definition; a brief clarification of what is meant by this category would improve readability.
The abstract states the benchmark 'includes (but is not limited to)' the six dimensions; if the full manuscript expands on additional dimensions or the benchmark structure, a short summary table would aid clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which highlight opportunities to strengthen the manuscript's clarity and grounding. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the six dimensions capture critical aspects of effective feedback and that the described AI failures reliably produce the listed harms (attainment, human-AI interaction, socioaffective) is presented as an argument without any concrete examples, operationalization details from L2-Bench, or supporting analysis; this premise is load-bearing for the typology and the call for new evaluation frameworks.

Authors: We agree that the abstract, constrained by length, presents the central claims at a high level without concrete examples or L2-Bench operationalization details. The body of the manuscript provides the supporting analysis, typology, and benchmark details. To address the concern that these claims are load-bearing, we will revise the abstract to incorporate brief illustrative examples of the pitfalls and their associated harms, while retaining its summary nature. This change will better foreground the empirical basis for the typology. revision: yes
Referee: [Discussion of contextual dynamics] The manuscript asserts that language-learning contexts amplify the risks of these pitfalls, yet provides no comparative discussion or references to prior work on feedback effectiveness in SLA (second language acquisition) that would ground why the six dimensions are exhaustive or why the harms follow in this domain specifically.

Authors: We acknowledge the validity of this observation. The manuscript discusses amplification based on domain-specific characteristics of language learning but does not include explicit comparative analysis or citations to SLA literature on feedback effectiveness. In revision, we will expand the relevant discussion section to incorporate key references from SLA research on corrective feedback and self-regulation, and to articulate why the six dimensions are particularly salient and why the identified harms are amplified in this context relative to other educational domains. We do not claim the dimensions are exhaustive, only critical; the revision will clarify this framing. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a qualitative typology of explanation failures in AI language-learning tools, drawing on six dimensions from L2-Bench and discussing associated risks. No equations, derivations, fitted parameters, predictions, or self-referential claims appear in the abstract or described structure. The work frames its contribution as analysis and open questions rather than any result obtained by construction from its own inputs. No load-bearing self-citations or ansatzes are referenced in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The contribution rests on introducing a new benchmark name and conceptual category of explainability pitfalls without external benchmarks, data, or prior evidence referenced in the abstract.

invented entities (1)

explainability pitfalls no independent evidence
purpose: To label and categorize AI-generated explanations in language learning that appear helpful but are flawed
New term introduced in the abstract to frame the analysis of failures.

pith-pipeline@v0.9.0 · 5757 in / 1099 out tokens · 29087 ms · 2026-05-25T06:28:25.868950+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 2 internal anchors

[1]

Gavin Abercrombie, Alice Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. 2023. Mirages: On Anthropomorphism in Dialogue Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4776–4790

work page 2023
[2]

Disha Agarwal, Mor Naaman, and Aditya Vashistha. 2024. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. arXiv:2409.11360

work page arXiv 2024
[3]

Anderson, David R

Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, NY

work page 2001
[4]

1977.Social Learning Theory

Albert Bandura. 1977.Social Learning Theory. Prentice Hall, Englewood Cliffs, NJ

work page 1977
[5]

Hamsa Bastani, Osbert Bastani, Alp Sungu, Haoyang Ge, Ozge Kabakci, and Rani Mariman. 2024. Generative AI Can Harm Learning. SSRN Working Paper, DOI: 10.2139/ssrn.4895486

work page doi:10.2139/ssrn.4895486 2024
[6]

Brookhart

Susan M. Brookhart. 2008.How to Give Effective Feedback to Your Students. ASCD, Alexandria, VA

work page 2008
[7]

Towards an Evaluation Methodology for AI in Second Language Education: Lessons Learned from Developing L2-Bench

James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott. 2026. Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education. arXiv:2603.20088 [cs.CY] https://arxiv.org/abs/2603.20088

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O

Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O. Riedl. 2024. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Article 316. doi:10.1145/3613904.3642474

work page doi:10.1145/3613904.3642474 2024
[9]

Upol Ehsan and Mark O. Riedl. 2021. Explainability Pitfalls: Beyond Dark Patterns in Explainable AI. InHuman-Centered AI Workshop at NeurIPS. arXiv:2109.12480

work page arXiv 2021
[10]

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, and Ari Holtzman. 2025. AbsenceBench: Language Models Can’t Tell What’s Missing. arXiv:2506.11440

work page arXiv 2025
[11]

Iason Gabriel, Andrea Manzini, Geoffrey Keeling, Lisa Anne Hendricks, Verena Rieser, Haroon Iqbal, Nenad Tomašev, Irina Ktena, Zachary Kenton, Manuel Rodriguez, Sam El-Sayed, Sarah Brown, Cansu Akbulut, Andrew Trask, Edward Hughes, Adam S. Bergman, Renee Shelby, Naomi Marchal, Casey Griffin, Juan Mateos-Garcia, Laura Weidinger, William Street, Benjamin La...

work page arXiv 2024
[12]

Michael Gerlich. 2025. AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.Societies15, 1 (2025), 6. doi:10.3390/soc15010006

work page doi:10.3390/soc15010006 2025
[13]

Graesser, Natalie K

Arthur C. Graesser, Natalie K. Person, and Joseph P. Magliano. 1995. Collaborative Dialogue Patterns in Naturalistic One-to-One Tutoring.Applied Cognitive Psychology9, 6 (1995), 495–522

work page 1995
[14]

John Hattie and Helen Timperley. 2007. The Power of Feedback.Review of Educational Research77, 1 (2007), 81–112

work page 2007
[15]

Wayne Holmes. 2024. AIED—Coming of Age?International Journal of Artificial Intelligence in Education34, 1 (2024), 1–11. doi:10.1007/s40593-023- 00352-3

work page doi:10.1007/s40593-023- 2024
[16]

2023.Guidance for Generative AI for Education and Research

Wayne Holmes and Fengchun Miao. 2023.Guidance for Generative AI for Education and Research. UNESCO

work page 2023
[17]

Wayne Holmes and Ilkka Tuomi. 2022. State of the Art and Practice in AI in Education.European Journal of Education57, 4 (2022), 542–570

work page 2022
[18]

Fiona Hyland. 1998. The Impact of Teacher Written Feedback on Individual Writers.Journal of Second Language Writing7, 3 (1998), 255–286

work page 1998
[19]

McKee, Daniel Gillick, et al

Ivan Jurenka, Matthias Kunesch, Kyle R. McKee, Daniel Gillick, et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach. arXiv:2407.12687

work page arXiv 2024
[20]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2024. Vernacularizing Taxonomies of Harm Is Essential for Operationalizing Holistic AI Safety. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 698–710

work page 2024
[22]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2026. A Vernacularized Taxonomy of Harms for AI in Education. InHandbook of Critical Studies in AI for Education, Wayne Holmes and Caroline Pelletier (Eds.). Edward Elgar. Forthcoming

work page 2026
[23]

Val Klenowski. 2009. Assessment for Learning Revisited: An Asia-Pacific Perspective.Assessment in Education: Principles, Policy & Practice16, 3 (2009), 263–268

work page 2009
[24]

Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, and Sarah Amos. 2025. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific. InBuilding Trust Workshop at the International Conference on Learning Representations

work page 2025
[25]

LearnLM Team. 2024. LearnLM: Improving Gemini for Learning. arXiv:2412.16429

work page arXiv 2024
[26]

LearnLM Team and Google. 2025. Evaluating Gemini in an Arena for Learning. arXiv:2505.24477v1 [cs.CY]

work page arXiv 2025
[27]

LearnLM Team, Google, and Eedi. 2025. AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms. Technical report

work page 2025
[28]

McNamara, Laura K

Danielle S. McNamara, Laura K. Allen, Matthew E. Jacovina, and Aaron D. Likens. 2023. Leveraging Large Language Models for Language Learning. Journal of Learning Analytics10, 3 (2023), 1–15

work page 2023
[29]

Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. 2025. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but May Increase Adopters’ Exam Performance. InProceedings of the Twelfth ACM Conference on Learning @ Scale (L@S ’25). Ass...

work page doi:10.1145/3698205.3733960 2025
[30]

1980.Mindstorms: Children, Computers, and Powerful Ideas

Seymour Papert. 1980.Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York, NY

work page 1980
[31]

Alison Pease, Anna Zamansky, and Sarah Wiseman. 2023. Pedagogical Implications of Large Language Models: Challenges and Opportunities.AI & Society(2023), 1–14. doi:10.1007/s00146-023-01753-4

work page doi:10.1007/s00146-023-01753-4 2023
[32]

Chris Piech, Mehran Sahami, Daphne Koller, Steve Cooper, and Paulo Blikstein. 2015. Modeling How Students Learn to Program. InProceedings of the 46th ACM Technical Symposium on Computer Science Education. ACM, 153–158. doi:10.1145/2676723.2677308

work page doi:10.1145/2676723.2677308 2015
[33]

Schegloff and Harvey Sacks

Emanuel A. Schegloff and Harvey Sacks. 1973. Opening Up Closings.Semiotica8, 4 (1973), 289–327

work page 1973
[34]

2019.Should Robots Replace Teachers? AI and the Future of Education

Neil Selwyn. 2019.Should Robots Replace Teachers? AI and the Future of Education. Polity Press, Cambridge, UK

work page 2019
[35]

Valerie J. Shute. 2008. Focus on Formative Feedback.Review of Educational Research78, 1 (2008), 153–189

work page 2008
[36]

Hopfenbeck

Gordon Stobart, Elaine Boyd, Anthony Green, and Therese N. Hopfenbeck. 2019.Effective Feedback: The Key to Successful Assessment for Learning. Oxford University Press

work page 2019
[37]

Stefano Teso, Oznur Alkan, Wolfgang Stammer, and Elizabeth Daly. 2023. Leveraging Explanations in Interactive Machine Learning: An Overview. Frontiers in Artificial Intelligence6 (2023). doi:10.3389/frai.2023.1066049

work page doi:10.3389/frai.2023.1066049 2023
[38]

Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s Two-Dimensional Taxonomy Applied to Task Creation and Learning Assessment. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 117–124. doi:10.1145/3341981.3344226

work page doi:10.1145/3341981.3344226 2019
[39]

Kurt VanLehn. 2011. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.Educational Psychologist46, 4 (2011), 197–221. doi:10.1080/00461520.2011.611369

work page doi:10.1080/00461520.2011.611369 2011
[40]

Vygotsky

Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

work page 1978
[41]

Laura Weidinger, Maximilian Rauh, Naomi Marchal, Andrea Manzini, Lisa Anne Hendricks, et al . 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv:2310.11986

work page arXiv 2023
[42]

2011.Embedded Formative Assessment

Dylan Wiliam. 2011.Embedded Formative Assessment. Solution Tree Press, Bloomington, IN

work page 2011
[43]

Simon Woodhead, Simon Blatchford, and Michael Webb. 2023. Can AI Tutors Improve Learning Outcomes at Scale? Results from a Randomized Controlled Trial. InProceedings of the International Conference on Learning Analytics and Knowledge. ACM, 489–495. 8 Knight et al

work page 2023
[44]

2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning

Beverly Park Woolf. 2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Burlington, MA

work page 2010

[1] [1]

Gavin Abercrombie, Alice Curry, Tanvi Dinkar, Verena Rieser, and Zeerak Talat. 2023. Mirages: On Anthropomorphism in Dialogue Systems. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 4776–4790

work page 2023

[2] [2]

Disha Agarwal, Mor Naaman, and Aditya Vashistha. 2024. AI Suggestions Homogenize Writing Toward Western Styles and Diminish Cultural Nuances. arXiv:2409.11360

work page arXiv 2024

[3] [3]

Anderson, David R

Lorin W. Anderson, David R. Krathwohl, Peter W. Airasian, Kathleen A. Cruikshank, Richard E. Mayer, Paul R. Pintrich, James Raths, and Merlin C. Wittrock. 2001.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Longman, New York, NY

work page 2001

[4] [4]

1977.Social Learning Theory

Albert Bandura. 1977.Social Learning Theory. Prentice Hall, Englewood Cliffs, NJ

work page 1977

[5] [5]

Hamsa Bastani, Osbert Bastani, Alp Sungu, Haoyang Ge, Ozge Kabakci, and Rani Mariman. 2024. Generative AI Can Harm Learning. SSRN Working Paper, DOI: 10.2139/ssrn.4895486

work page doi:10.2139/ssrn.4895486 2024

[6] [6]

Brookhart

Susan M. Brookhart. 2008.How to Give Effective Feedback to Your Students. ASCD, Alexandria, VA

work page 2008

[7] [7]

Towards an Evaluation Methodology for AI in Second Language Education: Lessons Learned from Developing L2-Bench

James Edgell, Wm. Matthew Kennedy, Isaac Pattis, Ben Knight, Danielle Carvalho, and Elizabeth Wonnacott. 2026. Beyond Accuracy: Towards a Robust Evaluation Methodology for AI Systems for Language Education. arXiv:2603.20088 [cs.CY] https://arxiv.org/abs/2603.20088

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O

Upol Ehsan, Samir Passi, Q. Vera Liao, Larry Chan, I-Hsiang Lee, Michael Muller, and Mark O. Riedl. 2024. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. InProceedings of the 2024 CHI Conference on Human Factors in Computing Systems. ACM, Article 316. doi:10.1145/3613904.3642474

work page doi:10.1145/3613904.3642474 2024

[9] [9]

Upol Ehsan and Mark O. Riedl. 2021. Explainability Pitfalls: Beyond Dark Patterns in Explainable AI. InHuman-Centered AI Workshop at NeurIPS. arXiv:2109.12480

work page arXiv 2021

[10] [10]

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, and Ari Holtzman. 2025. AbsenceBench: Language Models Can’t Tell What’s Missing. arXiv:2506.11440

work page arXiv 2025

[11] [11]

Iason Gabriel, Andrea Manzini, Geoffrey Keeling, Lisa Anne Hendricks, Verena Rieser, Haroon Iqbal, Nenad Tomašev, Irina Ktena, Zachary Kenton, Manuel Rodriguez, Sam El-Sayed, Sarah Brown, Cansu Akbulut, Andrew Trask, Edward Hughes, Adam S. Bergman, Renee Shelby, Naomi Marchal, Casey Griffin, Juan Mateos-Garcia, Laura Weidinger, William Street, Benjamin La...

work page arXiv 2024

[12] [12]

Michael Gerlich. 2025. AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking.Societies15, 1 (2025), 6. doi:10.3390/soc15010006

work page doi:10.3390/soc15010006 2025

[13] [13]

Graesser, Natalie K

Arthur C. Graesser, Natalie K. Person, and Joseph P. Magliano. 1995. Collaborative Dialogue Patterns in Naturalistic One-to-One Tutoring.Applied Cognitive Psychology9, 6 (1995), 495–522

work page 1995

[14] [14]

John Hattie and Helen Timperley. 2007. The Power of Feedback.Review of Educational Research77, 1 (2007), 81–112

work page 2007

[15] [15]

Wayne Holmes. 2024. AIED—Coming of Age?International Journal of Artificial Intelligence in Education34, 1 (2024), 1–11. doi:10.1007/s40593-023- 00352-3

work page doi:10.1007/s40593-023- 2024

[16] [16]

2023.Guidance for Generative AI for Education and Research

Wayne Holmes and Fengchun Miao. 2023.Guidance for Generative AI for Education and Research. UNESCO

work page 2023

[17] [17]

Wayne Holmes and Ilkka Tuomi. 2022. State of the Art and Practice in AI in Education.European Journal of Education57, 4 (2022), 542–570

work page 2022

[18] [18]

Fiona Hyland. 1998. The Impact of Teacher Written Feedback on Individual Writers.Journal of Second Language Writing7, 3 (1998), 255–286

work page 1998

[19] [19]

McKee, Daniel Gillick, et al

Ivan Jurenka, Matthias Kunesch, Kyle R. McKee, Daniel Gillick, et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach. arXiv:2407.12687

work page arXiv 2024

[20] [20]

Why Language Models Hallucinate

Adam Tauman Kalai, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang. 2025. Why Language Models Hallucinate. arXiv:2509.04664

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2024. Vernacularizing Taxonomies of Harm Is Essential for Operationalizing Holistic AI Safety. InProceedings of the AAAI/ACM Conference on AI, Ethics, and Society. 698–710

work page 2024

[22] [22]

Matthew Kennedy and Daniel Vargas Campos

Wm. Matthew Kennedy and Daniel Vargas Campos. 2026. A Vernacularized Taxonomy of Harms for AI in Education. InHandbook of Critical Studies in AI for Education, Wayne Holmes and Caroline Pelletier (Eds.). Edward Elgar. Forthcoming

work page 2026

[23] [23]

Val Klenowski. 2009. Assessment for Learning Revisited: An Asia-Pacific Perspective.Assessment in Education: Principles, Policy & Practice16, 3 (2009), 263–268

work page 2009

[24] [24]

Akash Kundu, Adrianna Tan, Theodora Skeadas, Rumman Chowdhury, and Sarah Amos. 2025. Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific. InBuilding Trust Workshop at the International Conference on Learning Representations

work page 2025

[25] [25]

LearnLM Team. 2024. LearnLM: Improving Gemini for Learning. arXiv:2412.16429

work page arXiv 2024

[26] [26]

LearnLM Team and Google. 2025. Evaluating Gemini in an Arena for Learning. arXiv:2505.24477v1 [cs.CY]

work page arXiv 2025

[27] [27]

LearnLM Team, Google, and Eedi. 2025. AI Tutoring Can Safely and Effectively Support Students: An Exploratory RCT in UK Classrooms. Technical report

work page 2025

[28] [28]

McNamara, Laura K

Danielle S. McNamara, Laura K. Allen, Matthew E. Jacovina, and Aaron D. Likens. 2023. Leveraging Large Language Models for Language Learning. Journal of Learning Analytics10, 3 (2023), 1–15

work page 2023

[29] [29]

Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, and Chris Piech. 2025. The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but May Increase Adopters’ Exam Performance. InProceedings of the Twelfth ACM Conference on Learning @ Scale (L@S ’25). Ass...

work page doi:10.1145/3698205.3733960 2025

[30] [30]

1980.Mindstorms: Children, Computers, and Powerful Ideas

Seymour Papert. 1980.Mindstorms: Children, Computers, and Powerful Ideas. Basic Books, New York, NY

work page 1980

[31] [31]

Alison Pease, Anna Zamansky, and Sarah Wiseman. 2023. Pedagogical Implications of Large Language Models: Challenges and Opportunities.AI & Society(2023), 1–14. doi:10.1007/s00146-023-01753-4

work page doi:10.1007/s00146-023-01753-4 2023

[32] [32]

Chris Piech, Mehran Sahami, Daphne Koller, Steve Cooper, and Paulo Blikstein. 2015. Modeling How Students Learn to Program. InProceedings of the 46th ACM Technical Symposium on Computer Science Education. ACM, 153–158. doi:10.1145/2676723.2677308

work page doi:10.1145/2676723.2677308 2015

[33] [33]

Schegloff and Harvey Sacks

Emanuel A. Schegloff and Harvey Sacks. 1973. Opening Up Closings.Semiotica8, 4 (1973), 289–327

work page 1973

[34] [34]

2019.Should Robots Replace Teachers? AI and the Future of Education

Neil Selwyn. 2019.Should Robots Replace Teachers? AI and the Future of Education. Polity Press, Cambridge, UK

work page 2019

[35] [35]

Valerie J. Shute. 2008. Focus on Formative Feedback.Review of Educational Research78, 1 (2008), 153–189

work page 2008

[36] [36]

Hopfenbeck

Gordon Stobart, Elaine Boyd, Anthony Green, and Therese N. Hopfenbeck. 2019.Effective Feedback: The Key to Successful Assessment for Learning. Oxford University Press

work page 2019

[37] [37]

Stefano Teso, Oznur Alkan, Wolfgang Stammer, and Elizabeth Daly. 2023. Leveraging Explanations in Interactive Machine Learning: An Overview. Frontiers in Artificial Intelligence6 (2023). doi:10.3389/frai.2023.1066049

work page doi:10.3389/frai.2023.1066049 2023

[38] [38]

Kelsey Urgo, Jaime Arguello, and Robert Capra. 2019. Anderson and Krathwohl’s Two-Dimensional Taxonomy Applied to Task Creation and Learning Assessment. InProceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. ACM, 117–124. doi:10.1145/3341981.3344226

work page doi:10.1145/3341981.3344226 2019

[39] [39]

Kurt VanLehn. 2011. The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems.Educational Psychologist46, 4 (2011), 197–221. doi:10.1080/00461520.2011.611369

work page doi:10.1080/00461520.2011.611369 2011

[40] [40]

Vygotsky

Lev S. Vygotsky. 1978.Mind in Society: The Development of Higher Psychological Processes. Harvard University Press, Cambridge, MA

work page 1978

[41] [41]

Laura Weidinger, Maximilian Rauh, Naomi Marchal, Andrea Manzini, Lisa Anne Hendricks, et al . 2023. Sociotechnical Safety Evaluation of Generative AI Systems. arXiv:2310.11986

work page arXiv 2023

[42] [42]

2011.Embedded Formative Assessment

Dylan Wiliam. 2011.Embedded Formative Assessment. Solution Tree Press, Bloomington, IN

work page 2011

[43] [43]

Simon Woodhead, Simon Blatchford, and Michael Webb. 2023. Can AI Tutors Improve Learning Outcomes at Scale? Results from a Randomized Controlled Trial. InProceedings of the International Conference on Learning Analytics and Knowledge. ACM, 489–495. 8 Knight et al

work page 2023

[44] [44]

2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning

Beverly Park Woolf. 2010.Building Intelligent Interactive Tutors: Student-Centered Strategies for Revolutionizing E-Learning. Morgan Kaufmann, Burlington, MA

work page 2010