arxiv: 2604.20996 · v1 · submitted 2026-04-22 · 💻 cs.CL

Recognition: unknown

AFRILANGTUTOR: Advancing Language Tutoring and Culture Education in Low-Resource Languages with Large Language Models

Tadesse Destaw Belay , Shahriar Kabir Nahin , Israel Abebe Azime , Ocean Monjur , Shamsuddeen Hassan Muhammad , Seid Muhie Yimam , Anshuman Chhabra

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords low-resource languagesAfrican languageslanguage tutoringlarge language modelssupervised fine-tuningdirect preference optimizationeducational AImultilingual models

0 comments

The pith

Fine-tuning LLMs on dictionary-derived African tutoring data produces consistent gains over base models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles the lack of resources for building AI language tutors in African languages by first assembling a large dictionary of English translations and then using it to automatically create thousands of multi-turn student-tutor dialogues. These dialogues form a training set that is used to adapt two existing multilingual models through supervised fine-tuning followed by preference optimization. The adapted models are then shown to receive higher scores than the original versions when another LLM judges their tutoring responses across four criteria. A reader would care because the method supplies a concrete starting point for educational AI in languages that otherwise have almost no suitable training material.

Core claim

We construct AFRILANGDICT containing 194.7K African language-English dictionary entries, then use it to generate AFRILANGEDU, a set of 78.9K multi-turn question-answer examples. Fine-tuning Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU for ten African languages yields AFRILANGTUTOR models that outperform their base counterparts, with the combination of supervised fine-tuning and direct preference optimization delivering improvements between 1.8% and 15.5% under LLM-as-a-judge evaluation on four tutoring criteria.

What carries the argument

AFRILANGEDU, the collection of automatically generated multi-turn student-tutor interactions derived from dictionary entries and used for supervised fine-tuning followed by direct preference optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dictionary-to-interaction pipeline could be reused for other low-resource languages outside the ten studied here.
Public release of the datasets lowers the barrier for independent groups to test alternative training methods or languages.
If the generated dialogues prove sufficiently realistic, they could serve as seed material for human-curated expansions rather than purely automatic ones.
Wider deployment might eventually support preservation of cultural knowledge carried in the languages themselves.

Load-bearing premise

That scores assigned by an LLM judge reliably indicate real tutoring quality for these languages.

What would settle it

A controlled study in which native speakers rate the same model outputs on the same four criteria and the human rankings are compared directly to the LLM-judge rankings.

Figures

Figures reproduced from arXiv: 2604.20996 by Anshuman Chhabra, Israel Abebe Azime, Ocean Monjur, Seid Muhie Yimam, Shahriar Kabir Nahin, Shamsuddeen Hassan Muhammad, Tadesse Destaw Belay.

**Figure 1.** Figure 1: Number of documents for 10 African LRLs in two widely used pretraining corpora: MADLAD400 (left) and FineWeb2 (right), compared with highresource: English (1.8B) and Russian (699M). by the volume of the specific language data on which they are pre-trained (Muennighoff et al., 2025). For low-resource languages (LRLs), limited training coverage leads to weak lexical knowledge and unreliable linguistic inte… view at source ↗

**Figure 2.** Figure 2: Overview of the AFRILANGTUTOR pipeline. Dictionary sources are collected and processed via OCR and human verification to construct AFRILANGDICT across 10 languages. These entries serve as seed data for synthetic generation of AFRILANGEDU, which comprises multi-turn tutoring dialogues and DPO preference pairs. Finally, Llama-3-8B and Gemma-3-12B are fine-tuned using SFT, DPO, and SFT+DPO to produce the AFRI… view at source ↗

**Figure 3.** Figure 3: Data format and examples: (a) AFRILANGDICT dictionary format, (b) DPO data, and (c) multi-turn dialog with 3 full turns. Both (b) and (c) comprise AFRILANGEDU and are generated using AFRILANGDICT. The multi-turn responses and the chosen answer for DPO is generated using the highly performant Gemini-2.5-Pro (Comanici et al., 2025), and the rejected answer for DPO is generated using various lower LRL quality… view at source ↗

**Figure 4.** Figure 4: Performance of our AFRILANGTUTOR LLMs (Llama-3-8B-IT and Gemma-3-12B-IT post SFT + DPO fine-tuning) across different question types in our AFRILANGEDU benchmark test set. of 26.82 jumps to 33.97. This demonstrates that while instruction-tuned models have cross-lingual capabilities, they suffer from a low-resource gap that general alignment cannot bridge. SFT with multi-turn dialogue data acts as a crucial … view at source ↗

**Figure 5.** Figure 5: the average agreement between humans and the LLM lies between 0.61 – 0.80, indicating Substantial agreement based on the interpretation scales for Cohen’s Kappa (Landis and Koch, 1977). Instructional Alignment Pedagogical Compliance Linguistic Cultural Accuracy Coherence and Naturalness 0.30 0.35 0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 W eig h t e d C o h e n's 0.74 0.67 0.59 0.53 0.69 0.66 0.53 0.60 0.78 … view at source ↗

**Figure 6.** Figure 6: Average influence score distribution of the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

How can language learning systems be developed for languages that lack sufficient training resources? This challenge is increasingly faced by developers across the African continent who aim to build AI systems capable of understanding and responding in local languages. To address this gap, we introduce AFRILANGDICT, a collection of 194.7K African language-English dictionary entries designed as seed resources for generating language-learning materials, enabling us to automatically construct large-scale, diverse, and verifiable student-tutor question-answer interactions suitable for training AI-assisted language tutors. Using AFRILANGDICT, we build AFRILANGEDU, a dataset of 78.9K multi-turn training examples for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Using AFRILANGEDU, we train language tutoring models collectively referred to as AFRILANGTUTOR. We fine-tune two multilingual LLMs: Llama-3-8B-IT and Gemma-3-12B-IT on AFRILANGEDU across 10 African languages and evaluate their performance. Our results show that models trained on AFRILANGEDU consistently outperform their base counterparts, and combining SFT and DPO yields substantial improvements, with gains ranging from 1.8% to 15.5% under LLM-as-a-judge evaluations across four criteria. To facilitate further research on low-resource languages -- all resources are available at https://huggingface.co/afrilang-edu.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships new open datasets for ten African languages but its performance claims rest on unvalidated LLM-as-judge scores with no human checks.

read the letter

The useful part here is the data release. AFRILANGDICT pulls together 194.7K dictionary entries across ten languages, then AFRILANGEDU turns those into 78.9K multi-turn tutoring dialogues. They fine-tune Llama-3-8B-IT and Gemma-3-12B-IT with SFT plus DPO and release everything on Hugging Face. That gives people working on low-resource African languages a concrete starting point instead of starting from scratch with almost no text data. The method itself is simple—dictionary to synthetic dialogues—but the scale and language coverage matter for a real gap. They report consistent gains from 1.8% to 15.5% on four criteria when the fine-tuned models are judged by another LLM. That direction makes sense on paper. The problem is the evaluation. The abstract and stress-test note give no sign of human validation for either the generated dialogues or the judge scores. No native-speaker ratings, no correlation study, no inter-annotator numbers. For tutoring quality, cultural fit, and pedagogical value in languages with thin pre-training data, an English-centric judge can easily rate fluent but shallow or off-target output too highly. Without those checks the reported deltas do not yet show that the models actually tutor better. The paper also skips details on prompt templates and quality filters in the summary we have. This work is aimed at researchers and developers building language-education tools for under-resourced settings. It is worth a serious referee because the datasets are new and openly available; a revision that adds human evaluation on a sample of the dialogues would make the claims much stronger. I would send it for review with that expectation.

Referee Report

1 major / 2 minor

Summary. The paper presents AFRILANGDICT (194.7K African language-English dictionary entries) as seed data to automatically generate AFRILANGEDU (78.9K multi-turn student-tutor QA examples). These are used to fine-tune Llama-3-8B-IT and Gemma-3-12B-IT via SFT and DPO for 10 African languages, producing AFRILANGTUTOR models. The central claim is that the fine-tuned models consistently outperform their base counterparts, with SFT+DPO yielding gains of 1.8–15.5% on LLM-as-a-judge scores across four criteria; all resources are released on Hugging Face.

Significance. If the evaluation holds, the work supplies concrete, open datasets and models that could accelerate AI-assisted tutoring for under-resourced African languages, a domain with clear practical need. The public release of AFRILANGDICT and AFRILANGEDU is a clear strength that supports reproducibility and follow-on research.

major comments (1)

[Abstract and Evaluation] Abstract and Evaluation: The headline result (consistent outperformance and 1.8–15.5% gains from SFT+DPO) rests entirely on LLM-as-a-judge ratings. No human correlation study, native-speaker validation of the 78.9K generated pairs, inter-annotator agreement, or error bars are reported. For low-resource languages this leaves open the possibility that the judge over-rates fluent but culturally shallow or pedagogically weak responses, directly undermining the claim that AFRILANGTUTOR improves tutoring quality.

minor comments (2)

[Methods] Methods: The manuscript supplies no details on the prompts, quality filters, or verification steps used to construct the 78.9K AFRILANGEDU examples from AFRILANGDICT.
[Results] Results: The reported percentage gains lack statistical significance tests or confidence intervals, making it difficult to assess whether the observed differences are reliable.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback on our work. We agree that evaluation robustness is essential for claims about tutoring quality in low-resource settings and address the specific concern below, outlining revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation: The headline result (consistent outperformance and 1.8–15.5% gains from SFT+DPO) rests entirely on LLM-as-a-judge ratings. No human correlation study, native-speaker validation of the 78.9K generated pairs, inter-annotator agreement, or error bars are reported. For low-resource languages this leaves open the possibility that the judge over-rates fluent but culturally shallow or pedagogically weak responses, directly undermining the claim that AFRILANGTUTOR improves tutoring quality.

Authors: We acknowledge that this is a valid and important limitation. Our primary evaluation does rely on LLM-as-a-judge scores without a human correlation study, native-speaker validation of the generated AFRILANGEDU pairs, inter-annotator agreement metrics, or reported error bars. For the ten low-resource African languages involved, large-scale human annotation by qualified native speakers is logistically difficult and resource-intensive, which is why we adopted the LLM-as-judge protocol following common practice in recent LLM evaluation literature. We will revise the manuscript to (1) expand the Limitations section with an explicit discussion of the risks that an LLM judge may over-rate fluent but shallow or culturally imprecise responses, (2) add error bars or standard deviations where multiple evaluation runs are feasible, and (3) more prominently highlight that the public release of AFRILANGDICT and AFRILANGEDU is intended to support exactly the kind of human validation studies the referee recommends. We do not claim the current results constitute definitive proof of superior tutoring quality; rather, they show that models trained on our datasets outperform their base versions under the chosen metric. This does not invalidate the core contribution of the datasets and models, but we accept that stronger human evidence would strengthen the claims. revision: partial

standing simulated objections not resolved

Providing a human correlation study, native-speaker validation of the 78.9K pairs, or inter-annotator agreement results, as these were not collected in the original work and would require new data collection beyond the scope of a standard revision.

Circularity Check

0 steps flagged

No significant circularity; performance claims rest on external evaluation

full rationale

The paper's central chain constructs AFRILANGDICT as seed data, derives AFRILANGEDU question-answer pairs from it, performs SFT+DPO fine-tuning on base models (Llama-3-8B-IT, Gemma-3-12B-IT), and reports relative gains via a separate LLM-as-a-judge protocol on four criteria. None of these steps reduces to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain; the judge scores are generated post-training by an independent model and are not algebraically or statistically forced by the training objective. The derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the assumption that bilingual dictionary entries can be automatically expanded into high-quality multi-turn tutoring dialogues without introducing systematic errors, and that LLM judges can serve as a proxy for human tutoring quality. No new entities are postulated.

axioms (2)

domain assumption Bilingual dictionary entries contain sufficient semantic information to generate pedagogically useful student-tutor interactions.
Invoked when the authors state that AFRILANGDICT enables automatic construction of large-scale, diverse, and verifiable QA pairs.
domain assumption LLM-as-a-judge scores correlate with actual tutoring effectiveness for low-resource languages.
Used to interpret the 1.8-15.5% gains as meaningful improvements.

pith-pipeline@v0.9.0 · 5605 in / 1523 out tokens · 25176 ms · 2026-05-10T00:32:23.657862+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages · 5 internal anchors

[1]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Dictprompt: Comprehensive dictionary- integrated prompt tuning for pre-trained language model.Knowledge-Based Systems, 273:110605. Maximillian Chen, Ruoxi Sun, Tomas Pfister, and Sercan O Arik. 2025. Learning to clarify: Multi- turn conversations with action-based contrastive self- training. InThe Thirteenth International Conference on Learning Representa...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Retrieval-Augmented Generation for Large Language Models: A Survey

From problem-solving to teaching problem- solving: Aligning LLMs with pedagogy using re- inforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pages 272–292, Suzhou, China. Association for Computational Linguistics. Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jia...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Beyond Single-Turn: A Survey on Multi-Turn Interactions with Large Language Models

Beyond single-turn: A survey on multi-turn interactions with large language models.Preprint, arXiv:2504.04717. Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. InText Summariza- tion Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics. Rongyan Liu, Xiande Chen, and Yunfeng Xu. 2025. Beyond re...

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan

Scaling data-constrained language models. Preprint, arXiv:2305.16264. Carmen del Rosario Navas Bonilla, Luis Miguel Viñan Carrasco, Jhoanna Carolina Gaibor Pupiales, and Daniel Eduardo Murillo Noriega. 2025. The fu- ture of education: A systematic literature review of self-directed learning with ai.Future Internet, 17(8). Sebastian Nordhoff and Harald Ham...

work page arXiv 2025
[5]

Gemma 2: Improving Open Language Models at a Practical Size

Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems, 36:53728–53741. Hossein A. Rahmani, Varsha Ramineni, Emine Yilmaz, Nick Craswell, and Bhaskar Mitra. 2025. Towards understanding bias in synthetic data for evaluation. InProceedings of the 34th ACM International Con- ference on...

work page internal anchor Pith review arXiv 2025
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Jingheng Ye, Shen Wang, Deqing Zou, Yibo Yan, Kun Wang, Hai-Tao Zheng, Ruitong Liu, Zenglin Xu, Ir- win King, Philip S. Yu, and Qingsong Wen. 2025. Position: LLMs can be good tutors in English edu- cation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13990–14009, Miami, Florida, USA

LexC-gen: Generating data for extremely low- resource languages with large language models and bilingual lexicons. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 13990–14009, Miami, Florida, USA. Association for Computational Linguistics. Chen Zhang, Xiao Liu, Jiuheng Lin, and Yansong Feng

2024
[8]

InFindings of the Association for Computational Linguistics: ACL 2024, pages 8783– 8800, Bangkok, Thailand

Teaching large language models an unseen lan- guage on the fly. InFindings of the Association for Computational Linguistics: ACL 2024, pages 8783– 8800, Bangkok, Thailand. Association for Computa- tional Linguistics. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InInt...

2024
[9]

Direct Q&A: Simple student–tutor explana- tion of a word (phrase or sentence) meaning
[10]

Quiz (Multiple Choice): language learner and the tutor interact in a question-and-answer conversation
[11]

Fill-in-the-Blank: A contextual sentence with a missing word
[12]

Role-play / Contextual Use: Greeting, school, or conversation simulation
[13]

Error Correction / Hinting: The student asks, and the tutor corrects the student’s mis- understanding
[14]

Sentence Building: The language learner (student) asks the tutor to build a sentence, and the tutor creates a complete sentence us- ing the word
[15]

Translation Practice: Forward and backward translation check
[16]

Spelling & Pronunciation: Language transliteration or phonetic spelling practice and spelling correction
[17]

Cultural Note Integration: Explanation of cultural or contextual relevance
[18]

Grammar Explanation: The student asks about a grammar rule involving the target word, and the tutor provides a clear and simple explanation with examples. For DPO training, by adding chosen and rejected features from the data, enabling the model to learn answering styles in various negative example sce- narios from both the language learner (student) and ...
[19]

Misspelled / Typo: The student attempts to ask about the target word but makes a sig- nificant spelling error (e.g., swapping letters, omitting vowels, or using phonetic spelling)
[20]

What about this?

Vague / Ambiguous: The student provides insufficient context or is unclear about their intent (e.g., typing only the word or asking "What about this?" without specifying the tar- get)
[21]

Irrelevant / Mixed Context: The student mixes the language-learning question with an unrelated topic (e.g., Python code, weather prediction, or general knowledge)
[22]

Since [WORD] means [WRONG_MEANING], can I use it to de- scribe a river?

Factually Wrong Premise: The student asks a question based on a confidently stated false assumption (e.g., "Since [WORD] means [WRONG_MEANING], can I use it to de- scribe a river?")
[23]

What color is this verb?

Out-of-Scope / Nonsensical: The student asks for inappropriate usage (e.g., how to use the [WORD] as an insult) or poses impossible questions about abstract words (e.g., "What color is this verb?"). B Influence Analysis We use TraceIn (Pruthi et al., 2020) as the influ- ence function for calculating the influence score between training and validation samp...

2020
[24]

and the cthe default (β 0.1, batch size 1, epoch
[25]

package": a clear explanation of the concept, multiple relevant examples, and a practice prompt or

and the custom (β= 0.5 , batch size 4, epoch 10) from LlamaFactory5 LLM fine-tuning framework. We also make a full fine-tuning and fine-tuning with LORA settings. H Additional Results across Dialog Types Table 8 shows results across dialog types. I Additional Automatic Evaluation Metric Results Table 9 shows results from automatic evaluation metrics (BERT...