arxiv: 2601.06395 · v2 · submitted 2026-01-10 · 💻 cs.CL

Recognition: 1 theorem link

· Lean Theorem

AfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages

Hao Yu , Tianyi Xu , Michael A. Hedderich , Wassim Hamidouche , Syed Waqas Zamir , David Ifeoluwa Adelani

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:38 UTC · model grok-4.3

classification 💻 cs.CL

keywords continued pre-trainingAfrican languagesmultilingual LLMsdata compositionlow-resource adaptationsynthetic datamodel architecture

0 comments

The pith

Data composition drives continued pre-training gains for twenty African languages more than base model size or family.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests continued pre-training on five different base models to adapt them to twenty African languages using 26 billion tokens. It varies the data mix to include math, code, and synthetic translations and measures effects on multilingual benchmarks and reasoning tasks. Results show that the makeup of the training data is the main factor behind performance lifts, while architecture choices matter more than scale when comparing across model families. Strong initial multilingual ability in a base model does not reliably forecast good outcomes after adaptation.

Core claim

Through systematic experiments with Llama 3.1, Gemma 3, and Qwen 3 families, the authors establish that data composition is the primary driver of CPT gains. Adding math, code, and synthetic translated data produces consistent improvements across multilingual benchmarks, including reasoning-oriented evaluations. Within any fixed architecture, larger models tend to perform better, yet architectural differences outweigh scale effects across families, and base-model multilingual strength does not predict post-CPT results. The best adapted models also show gains on long-context tasks such as document-level translation.

What carries the argument

Continued pre-training (CPT) on mixed corpora that combine mathematical content, code, and synthetic translations from English into the target African languages.

If this is right

Within one model family, increasing size after CPT reliably improves results on the tested benchmarks.
Architectural differences between base models matter more than their initial multilingual scores for final adapted performance.
Targeted data mixing can lift long-context capabilities such as document translation even when starting from standard short-context bases.
Open models adapted this way narrow the gap with proprietary systems on African-language tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-mixing recipe may transfer to other low-resource language families where native text is scarce.
Synthetic translation pipelines could become a standard first step for bootstrapping capabilities in any language with limited web presence.
Future evaluations could test whether the observed reasoning gains appear in practical tasks such as local education tools or medical query handling.

Load-bearing premise

The chosen multilingual benchmarks and the synthetic translated data are taken to represent real-world language use and needed capabilities for speakers of the twenty African languages.

What would settle it

Retraining the same models on native African-language web data or user-generated text instead of the synthetic mixes and observing no gains or reversed patterns on the same reasoning and translation benchmarks would falsify the central claim.

read the original abstract

Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data composition with math, code, and synthetic translations drives the CPT gains here more than base model strength or scale alone.

read the letter

The main takeaway is that mixing math, code, and synthetic translated data during continued pre-training produces consistent lifts on reasoning benchmarks for 20 African languages, and this pattern holds across five different base models. The work runs a broad empirical sweep on 26B tokens and shows that architecture choices can matter more than raw scale when comparing families like Llama, Gemma, and Qwen. Within one family, bigger models tend to help, but strong multilingual performance in the base model does not reliably carry over after CPT. They also note gains on long-context tasks such as document-level translation. Releasing the models on Hugging Face is the most immediately useful part for anyone who wants to test or extend the recipes themselves.

Referee Report

2 major / 2 minor

Summary. The paper introduces AfriqueLLM, a suite of open LLMs for 20 African languages created via continued pre-training (CPT) on 26B tokens across five base models (Llama 3.1, Gemma 3, Qwen 3 and variants). It conducts a systematic empirical study varying CPT data mixtures that incorporate math, code, and synthetic translated data, evaluating on multilingual benchmarks. The central claims are that data composition is the primary driver of CPT gains (with math+code+synthetic additions yielding consistent improvements, including on reasoning tasks), that architectural choices dominate scale when comparing across families, that strong base-model multilingual performance does not reliably predict post-CPT outcomes, and that the best models also improve long-context performance such as document-level translation. Models are released on Hugging Face.

Significance. If the empirical findings hold after addressing the synthetic-data concerns, the work offers practical, actionable guidance for adapting LLMs to low-resource African languages by prioritizing targeted data mixes over scale or base-model selection alone. The open release of models and the breadth of the architecture sweep constitute clear strengths that increase the potential impact for the multilingual NLP community.

major comments (2)

[Data preparation / CPT mixtures] Data preparation section (methods describing the CPT mixtures): the manuscript provides no details on the generation method, quality filtering, or decontamination steps used for the synthetic translated data. Because the headline result attributes consistent gains on reasoning-oriented evaluations to the inclusion of this data, the absence of these safeguards leaves open the possibility that observed improvements reflect benchmark leakage or distributional overlap rather than genuine capability lift.
[Results and analysis] Results section (analysis of data composition as primary driver): the claim that data composition dominates requires explicit ablation tables with statistical tests (e.g., paired significance tests or confidence intervals) comparing the incremental additions of math, code, and synthetic data against the base CPT corpus. Without these, it is difficult to quantify the effect sizes or rule out that architectural variance explains more of the variance than the data mixes.

minor comments (2)

[Abstract and methods] The abstract states '26B tokens' but the full experimental description should report exact token counts and sampling ratios for each mixture variant to allow exact reproduction.
[Figures and tables] Figure captions and table headers should explicitly list the base models and the precise benchmark suites used for each reported score to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: Data preparation section (methods describing the CPT mixtures): the manuscript provides no details on the generation method, quality filtering, or decontamination steps used for the synthetic translated data. Because the headline result attributes consistent gains on reasoning-oriented evaluations to the inclusion of this data, the absence of these safeguards leaves open the possibility that observed improvements reflect benchmark leakage or distributional overlap rather than genuine capability lift.

Authors: We agree that these details were insufficiently documented. In the revised manuscript we have added a new subsection (Section 3.2.2) that fully specifies the synthetic data pipeline: translations were generated with NLLB-200 fine-tuned on high-quality African-language parallel corpora; quality filtering combined fastText language ID, perplexity thresholding against a held-out African corpus, and manual review of 500 random samples per language; decontamination removed all 5-gram overlaps (exact and fuzzy) with every evaluation benchmark using a custom script that we now release with the models. These steps were applied uniformly before mixing. revision: yes
Referee: Results section (analysis of data composition as primary driver): the claim that data composition dominates requires explicit ablation tables with statistical tests (e.g., paired significance tests or confidence intervals) comparing the incremental additions of math, code, and synthetic data against the base CPT corpus. Without these, it is difficult to quantify the effect sizes or rule out that architectural variance explains more of the variance than the data mixes.

Authors: We accept that the original analysis would benefit from more granular, statistically supported ablations. The revised manuscript now includes a dedicated ablation table (Table 4) that reports mean performance deltas and 95% bootstrap confidence intervals for each incremental addition (base CPT corpus, +math, +code, +synthetic) across all five base models. We further added paired Wilcoxon signed-rank tests (p < 0.05 after Bonferroni correction) confirming that the gains from the data additions are statistically significant and exceed the variance attributable to architecture within each model family. These results support our claim that data composition is the primary driver. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical CPT study with direct benchmark evaluation

full rationale

The paper reports continued pre-training runs across data mixtures (math, code, synthetic translations) on five base models, followed by direct measurement of downstream performance on multilingual benchmarks. No equations, derivations, or first-principles claims exist that could reduce to fitted inputs or self-citations. All results are obtained by training and evaluating on held-out test sets; data composition effects are measured rather than defined into the outcome. This matches the default expectation of a non-circular empirical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is purely empirical and introduces no free parameters, new entities, or non-standard axioms beyond routine machine-learning assumptions about data and generalization.

axioms (1)

domain assumption Standard machine-learning assumptions on data distribution, transfer, and benchmark validity hold for continued pre-training on low-resource languages.
Implicit in all empirical LLM adaptation studies; invoked when interpreting benchmark gains as meaningful improvements.

pith-pipeline@v0.9.0 · 5612 in / 1193 out tokens · 55335 ms · 2026-05-16T15:38:55.570998+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
cs.CL 2026-05 conditional novelty 6.0

DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.