pith. sign in

arxiv: 2604.09619 · v1 · submitted 2026-03-17 · 💻 cs.CY · cs.AI· cs.CL

Assessing the Pedagogical Readiness of Large Language Models as AI Tutors in Low-Resource Contexts: A Case Study of Nepal's K-10 Curriculum

Pith reviewed 2026-05-15 10:33 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL
keywords Large Language ModelsAI TutorsPedagogical EvaluationLow-Resource EducationNepal CurriculumCurriculum AlignmentEducational TechnologyCultural Contextualization
0
0 comments X

The pith

Off-the-shelf LLMs are not ready for autonomous use as tutors in Nepalese classrooms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests four leading large language models as potential AI tutors for Nepal's grades 5-10 science and mathematics curriculum. It builds a custom benchmark and scores the models on seven aspects of teaching quality. The results show strong factual performance overall but persistent shortfalls in explaining ideas accessibly to beginners and in using examples that fit Nepalese culture. A reader would care because many hope AI can expand tutoring access in low-resource settings, yet this evaluation indicates current models would leave students with confusing or mismatched support without human involvement.

Core claim

The evaluation identifies a curriculum-alignment gap: frontier models reach approximately 97 percent aggregate reliability yet show clear weaknesses in pedagogical clarity and cultural contextualization. Two recurring failure modes appear: the Expert's Curse, in which models solve advanced problems but cannot explain them simply to novices, and the Foundational Fallacy, in which accuracy drops on easier lower-grade material because the models do not adjust to younger learners' needs. Regional models additionally display a Contextual Blindspot, omitting culturally relevant examples in more than 20 percent of cases. These patterns lead the authors to conclude that off-the-shelf LLMs cannot yet

What carries the argument

A curriculum-aligned benchmark that scores responses on seven binary metrics: Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, and Solution Accuracy.

If this is right

  • Models require human-in-the-loop oversight rather than independent classroom deployment.
  • Curriculum-specific fine-tuning offers a route to close the identified alignment gaps.
  • Performance on simpler foundational material must improve to support younger students.
  • Cultural adaptation is essential for any regional model to avoid contextual blindspots.
  • The same evaluation approach can be applied to other local curricula to check readiness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same clarity and cultural-fit problems likely appear when general LLMs are applied to other non-Western school systems.
  • Adding direct feedback from Nepalese teachers into the scoring process could strengthen the benchmark's validity.
  • Fine-tuning the models on Nepalese curriculum examples and local contexts would probably raise the contextual relevance scores.
  • The age-related performance drop suggests AI tutors need explicit mechanisms to scale explanations to different grade levels.

Load-bearing premise

The seven binary metrics and the new curriculum-aligned benchmark accurately reflect real pedagogical effectiveness and cultural relevance in Nepalese classrooms without direct teacher validation or student outcome measurements.

What would settle it

A classroom trial that tracks actual student test scores and engagement levels when using the evaluated LLMs versus human teachers in Nepalese schools would show whether the reported clarity and contextual gaps produce measurable differences in learning.

Figures

Figures reproduced from arXiv: 2604.09619 by Isha Sharma Gauli, Kiran Parajuli, Prasansha Bharati, Pratyush Acharya, Yokibha Chapagain.

Figure 1
Figure 1. Figure 1: GPT-4o Metric Profile [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qwen3 Metric Profile [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

The integration of Large Language Models (LLMs) into educational ecosystems promises to democratize access to personalized tutoring, yet the readiness of these systems for deployment in non-Western, low-resource contexts remains critically under-examined. This study presents a systematic evaluation of four state-of-the-art LLMs--GPT-4o, Claude Sonnet 4, Qwen3-235B, and Kimi K2--assessing their capacity to function as AI tutors within the specific curricular and cultural framework of Nepal's Grade 5-10 Science and Mathematics education. We introduce a novel, curriculum-aligned benchmark and a fine-grained evaluation framework inspired by the "natural language unit tests" paradigm, decomposing pedagogical efficacy into seven binary metrics: Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, and Solution Accuracy. Our results reveal a stark "curriculum-alignment gap." While frontier models (GPT-4o, Claude Sonnet 4) achieve high aggregate reliability (approximately 97%), significant deficiencies persist in pedagogical clarity and cultural contextualization. We identify two pervasive failure modes: the "Expert's Curse," where models solve complex problems but fail to explain them clearly to novices, and the "Foundational Fallacy," where performance paradoxically degrades on simpler, lower-grade material due to an inability to adapt to younger learners' cognitive constraints. Furthermore, regional models like Kimi K2 exhibit a "Contextual Blindspot," failing to provide culturally relevant examples in over 20% of interactions. These findings suggest that off-the-shelf LLMs are not yet ready for autonomous deployment in Nepalese classrooms. We propose a "human-in-the-loop" deployment strategy and offer a methodological blueprint for curriculum-specific fine-tuning to align global AI capabilities with local educational needs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that off-the-shelf LLMs are not ready for autonomous deployment as AI tutors in Nepal's K-10 Science and Mathematics curriculum. It introduces a curriculum-aligned benchmark with seven binary metrics (Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, Solution Accuracy) evaluated on four models (GPT-4o, Claude Sonnet 4, Qwen3-235B, Kimi K2), identifying failure modes such as the 'Expert's Curse' and 'Foundational Fallacy', with frontier models achieving ~97% aggregate reliability but showing deficiencies in clarity and contextualization.

Significance. If the introduced benchmark and metrics are validated against real-world pedagogical outcomes, this work would significantly contribute to understanding LLM limitations in low-resource, culturally specific educational settings, supporting the need for human-in-the-loop approaches and localized fine-tuning. The focus on Nepal's curriculum fills an important gap in AI education research.

major comments (3)
  1. Abstract: The reported aggregate reliability of approximately 97% for frontier models lacks accompanying sample sizes, number of evaluated interactions, inter-rater agreement details, statistical tests, or baseline comparisons, undermining the ability to assess the reliability of the central findings on curriculum-alignment gaps.
  2. Evaluation Framework (and Abstract): The seven binary metrics and the curriculum-aligned benchmark are presented without evidence of validation by Nepalese teachers, alignment with local pedagogical standards, or correlation with actual student learning gains, which is critical since the claim that LLMs are not ready for deployment depends on these metrics accurately capturing pedagogical effectiveness.
  3. Results section: The identification of failure modes like 'Expert's Curse' and 'Foundational Fallacy' is based on the unvalidated metrics; without external validation or student outcome data, these may reflect artifacts of the evaluation framework rather than genuine pedagogical shortcomings.
minor comments (2)
  1. Abstract: The term 'natural language unit tests' paradigm is introduced but not clearly defined or referenced to prior work.
  2. Consider adding a table summarizing per-model, per-metric results with exact counts to improve clarity and reproducibility.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thoughtful and constructive feedback. We address each major comment below and outline revisions to improve clarity and transparency while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: Abstract: The reported aggregate reliability of approximately 97% for frontier models lacks accompanying sample sizes, number of evaluated interactions, inter-rater agreement details, statistical tests, or baseline comparisons, undermining the ability to assess the reliability of the central findings on curriculum-alignment gaps.

    Authors: We agree that the abstract should be more self-contained. In the revised manuscript we will incorporate the total number of evaluated interactions, inter-rater agreement statistics, and a brief description of how the aggregate reliability figure is computed, drawing directly from the methods section. Because this is a novel curriculum-specific benchmark, external baselines are not available; we will note this limitation explicitly. revision: yes

  2. Referee: Evaluation Framework (and Abstract): The seven binary metrics and the curriculum-aligned benchmark are presented without evidence of validation by Nepalese teachers, alignment with local pedagogical standards, or correlation with actual student learning gains, which is critical since the claim that LLMs are not ready for deployment depends on these metrics accurately capturing pedagogical effectiveness.

    Authors: The metrics were constructed by direct reference to Nepal’s official Grade 5–10 Science and Mathematics curriculum documents and standard pedagogical criteria in the education literature. We will expand the methods section with an explicit mapping of each metric to curriculum objectives and relevant pedagogical sources. A full teacher-validation study and correlation with student learning outcomes would require a separate, resource-intensive field experiment that lies outside the scope of the present paper; we will therefore revise the abstract and discussion to frame the results as evidence of curriculum-alignment and basic pedagogical shortcomings rather than a definitive demonstration of deployment unreadiness. revision: partial

  3. Referee: Results section: The identification of failure modes like 'Expert's Curse' and 'Foundational Fallacy' is based on the unvalidated metrics; without external validation or student outcome data, these may reflect artifacts of the evaluation framework rather than genuine pedagogical shortcomings.

    Authors: The failure modes were derived from systematic qualitative review of the specific interactions that failed the Clarity and Contextual Relevance metrics. We will add additional annotated examples and inter-annotator notes in the appendix to make the derivation process transparent. We accept that these patterns would be strengthened by external validation and will add a dedicated limitations paragraph acknowledging this point and outlining plans for future classroom studies. revision: partial

standing simulated objections not resolved
  • Direct empirical correlation between the proposed metrics and measurable student learning gains in Nepalese classrooms, which would require a controlled longitudinal intervention study beyond the resources and scope of the current project.

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation against externally defined benchmark

full rationale

The paper performs an empirical case study by applying four LLMs to a curriculum-aligned benchmark for Nepal's Grade 5-10 Science and Mathematics. It decomposes performance into seven explicitly defined binary metrics (Prompt Alignment, Factual Correctness, etc.) and reports observed failure modes. No equations, fitted parameters, or self-citations are used to derive results; the central claim follows directly from the measured outcomes on the stated metrics. The benchmark and metrics are introduced as novel but are not shown to reduce to prior author work by construction. This is a standard non-circular empirical assessment.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the unvalidated assumption that the seven binary metrics fully measure pedagogical readiness and that the observed failure modes generalize beyond the tested prompts and models.

axioms (1)
  • domain assumption The seven binary metrics (Prompt Alignment, Factual Correctness, Clarity, Contextual Relevance, Engagement, Harmful Content Avoidance, Solution Accuracy) sufficiently measure pedagogical readiness for Nepalese students.
    Framework is presented as inspired by natural language unit tests but no justification or external validation is given for why these exact binaries capture teaching quality.
invented entities (2)
  • Expert's Curse no independent evidence
    purpose: Label for the observed pattern where models solve complex problems but fail to explain them clearly to novices.
    Named phenomenon derived from evaluation results with no independent evidence outside the paper.
  • Foundational Fallacy no independent evidence
    purpose: Label for the observed pattern where performance degrades on simpler lower-grade material.
    Named phenomenon derived from evaluation results with no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5666 in / 1434 out tokens · 60798 ms · 2026-05-15T10:33:11.739346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 2 internal anchors

  1. [1]

    AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges,

    M. Smith et al., “AI-Powered Educational Agents: Opportunities, Innovations, and Ethical Challenges,”MDPI, vol. 16, no. 6, p. 469. Available:https://www.mdpi. com/2078-2489/16/6/469

  2. [2]

    Lmunit: Fine-grained evaluation with natural language unit tests, 2024

    J. Saad-Falcon et al., “LMUnit: Fine-grained Evaluation with Natural Language Unit Tests,”arXiv preprint arXiv:2412.13091, 2025. Available:https://arxiv. org/abs/2412.13091

  3. [3]

    GPT-4o System Card,

    OpenAI, “GPT-4o System Card,” 2024. Available:https://cdn.openai.com/ gpt-4o-system-card.pdf

  4. [4]

    Findings of the Association for Computational Linguistics: EMNLP 2025,

    “Findings of the Association for Computational Linguistics: EMNLP 2025,” ACL Anthology. Available:https://aclanthology.org/volumes/2025. findings-emnlp/

  5. [5]

    ValuesRAG: Enhancing Cultural Alignment Through Retrieval- Augmented Contextual Learning,

    R. Chen et al., “ValuesRAG: Enhancing Cultural Alignment Through Retrieval- Augmented Contextual Learning,”ResearchGate, 2025. Available:https: //www.researchgate.net/publication/387671320_ValuesRAG_Enhancing_ Cultural_Alignment_Through_Retrieval-Augmented_Contextual_Learning

  6. [6]

    Fluent but Foreign: Even Regional LLMs Lack Cultural Align- ment,

    P. Agarwal et al., “Fluent but Foreign: Even Regional LLMs Lack Cultural Align- ment,”arXiv preprint arXiv:2505.21548v3, 2025. Available:https://arxiv.org/ html/2505.21548v3

  7. [7]

    Policy Framework for Education Develop- ment in Nepal,

    Ministry of Education, Nepal, “Policy Framework for Education Develop- ment in Nepal,” 2020. Available:https://www.researchgate.net/publication/ 338242140_Policy_Framework_for_Education_Development_in_Nepal

  8. [8]

    Report on Digital Transformation in Higher Education in South Asia,

    UNESCO, “Report on Digital Transformation in Higher Education in South Asia,” United Nations Educational, Scientific and Cultural Organization, 2024. Available:https://www.unesco.org/sdg4education2030/en/publication/ report-digital-transformation-higher-education-south-asia

  9. [9]

    Adopting AI in Education: Optimizing Human Re- source Management Through Technology Acceptance,

    D. A. Gârdan et al., “Adopting AI in Education: Optimizing Human Re- source Management Through Technology Acceptance,”Frontiers in Education, vol. 10, 2025. Available:https://www.frontiersin.org/journals/education/ articles/10.3389/feduc.2025.1488147/full

  10. [10]

    Digital Divide in AI-Powered Education: Challenges and Solu- tions for Inclusive Learning,

    S. Khan et al., “Digital Divide in AI-Powered Education: Challenges and Solu- tions for Inclusive Learning,”Journal of Information Systems Engineering and Man- agement, vol. 9, no. 4, 2024. Available:https://jisem-journal.com/index.php/ journal/article/view/3327 12

  11. [11]

    Artificial Intelligence for Higher Education: Benefits, Challenges, and Pre-service Teachers’ Perspectives,

    I. Ivanova et al., “Artificial Intelligence for Higher Education: Benefits, Challenges, and Pre-service Teachers’ Perspectives,”Frontiers in Education, vol. 9, 2024. Avail- able:https://www.frontiersin.org/journals/education/articles/10.3389/ feduc.2024.1501819/full

  12. [12]

    System Card: Claude Opus 4 and Claude Son- net 4,

    Anthropic, “System Card: Claude Opus 4 and Claude Son- net 4,” 2025. Available:https://www-cdn.anthropic.com/ 6d8a8055020700718b0c49369f60816ba2a7c285.pdf

  13. [13]

    From Superficial Outputs to Superficial Learning: Risks of Large Lan- guage Models in Education,

    G. Author, “From Superficial Outputs to Superficial Learning: Risks of Large Lan- guage Models in Education,”arXiv preprint arXiv:2509.21972v1, 2025. Available: https://arxiv.org/html/2509.21972v1

  14. [14]

    Science Education for the Twenty First Century,

    J. Osborne, “Science Education for the Twenty First Century,” Eurasia Journal of Mathematics, Science and Technology Ed- ucation, 2007. Available:https://www.ejmste.com/download/ science-education-for-thetwenty-first-century-4065.pdf

  15. [15]

    Qwen3 Technical Report

    A. Yang et al., “Qwen3 Technical Report,”arXiv preprint arXiv:2505.09388, 2025. Available:https://arxiv.org/abs/2505.09388

  16. [16]

    Kimi K2: Open Agentic Intelligence

    Moonshot AI, “Kimi K2: Open Agentic Intelligence,”arXiv preprint arXiv:2507.20534, 2025. Available:https://arxiv.org/abs/2507.20534

  17. [17]

    AI Ethics for the Global South: Perspectives, Practi- calities, and India’s role,

    A. Vijayakumar, “AI Ethics for the Global South: Perspectives, Practi- calities, and India’s role,”Research and Information System for Developing Countries (RIS). Available:https://www.ris.org.in/sites/default/files/ Publication/DP-296-Anupama-Vijayakumar.pdf

  18. [18]

    The2SigmaProblem: TheSearchforMethodsofGroupInstructionas Effective as One-to-One Tutoring,

    B.S.Bloom, “The2SigmaProblem: TheSearchforMethodsofGroupInstructionas Effective as One-to-One Tutoring,”Educational Researcher, vol. 13, no. 6, pp. 4–16,

  19. [19]

    Available:https://web.mit.edu/5.95/readings/bloom-two-sigma.pdf

  20. [20]

    Cognitive Load During Problem Solving: Effects on Learning,

    J. Sweller, “Cognitive Load During Problem Solving: Effects on Learning,”Cognitive Science, vol. 12, no. 2, pp. 257–285, 1988

  21. [21]

    The Expertise Re- versal Effect,

    S. Kalyuga, P. Ayres, P. Chandler, and J. Sweller, “The Expertise Re- versal Effect,”Educational Psychologist, vol. 38, no. 1, pp. 23–31, 2003. Available:https://www.uky.edu/~gmswan3/EDC608/Kalyuga2007_Article_ ExpertiseReversalEffectAndItsI.pdf

  22. [22]

    Cambridge, MA: Harvard University Press, 1978

    L.S.Vygotsky,Mind in Society: The Development of Higher Psychological Processes. Cambridge, MA: Harvard University Press, 1978

  23. [23]

    Piaget’s Theory,

    J. Piaget, “Piaget’s Theory,” inCarmichael’s Manual of Child Psychology, P. H. Mussen, Ed. New York: Wiley, 1970

  24. [24]

    The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems,

    K. VanLehn, “The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems,”Educational Psychologist, vol. 46, no. 4, pp. 197–221, 2011

  25. [25]

    Challenging Cog- nitive Load Theory: The Role of Educational Neuroscience, Artificial Intelligence, 13 and Machine Learning,

    E. Gkintoni, H. Antonopoulou, A. Sortwell, and C. Halkiopoulos, “Challenging Cog- nitive Load Theory: The Role of Educational Neuroscience, Artificial Intelligence, 13 and Machine Learning,”Brain Sciences, vol. 15, no. 2, p. 203, 2025. Available: https://pmc.ncbi.nlm.nih.gov/articles/PMC11852728/

  26. [26]

    A Comprehensive Review of AI-based Intelligent Tutoring Sys- tems: Applications and Challenges,

    M. Zerkouk et al., “A Comprehensive Review of AI-based Intelligent Tutoring Sys- tems: Applications and Challenges,”arXiv preprint arXiv:2507.18882, 2025. Avail- able:https://arxiv.org/abs/2507.18882

  27. [27]

    Artificial Intelligence and Education: Guidance for Policy-Makers,

    UNESCO, “Artificial Intelligence and Education: Guidance for Policy-Makers,” United Nations Educational, Scientific and Cultural Organization, Paris, 2021. Avail- able:https://unesdoc.unesco.org/ark:/48223/pf0000376709

  28. [28]

    A systematic review of AI-driven intelligent tutoring systems (ITS) in K-12 education,

    A. Molenaar et al., “A systematic review of AI-driven intelligent tutoring systems (ITS) in K-12 education,”npj Science of Learning, vol. 10, no. 1, p. 23, 2025. Avail- able:https://www.nature.com/articles/s41539-025-00320-7 14