Recognition: 1 theorem link
· Lean TheoremAfriqueLLM: How Data Mixing and Model Architecture Impact Continued Pre-training for African Languages
Pith reviewed 2026-05-16 15:38 UTC · model grok-4.3
The pith
Data composition drives continued pre-training gains for twenty African languages more than base model size or family.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Through systematic experiments with Llama 3.1, Gemma 3, and Qwen 3 families, the authors establish that data composition is the primary driver of CPT gains. Adding math, code, and synthetic translated data produces consistent improvements across multilingual benchmarks, including reasoning-oriented evaluations. Within any fixed architecture, larger models tend to perform better, yet architectural differences outweigh scale effects across families, and base-model multilingual strength does not predict post-CPT results. The best adapted models also show gains on long-context tasks such as document-level translation.
What carries the argument
Continued pre-training (CPT) on mixed corpora that combine mathematical content, code, and synthetic translations from English into the target African languages.
If this is right
- Within one model family, increasing size after CPT reliably improves results on the tested benchmarks.
- Architectural differences between base models matter more than their initial multilingual scores for final adapted performance.
- Targeted data mixing can lift long-context capabilities such as document translation even when starting from standard short-context bases.
- Open models adapted this way narrow the gap with proprietary systems on African-language tasks.
Where Pith is reading between the lines
- The same data-mixing recipe may transfer to other low-resource language families where native text is scarce.
- Synthetic translation pipelines could become a standard first step for bootstrapping capabilities in any language with limited web presence.
- Future evaluations could test whether the observed reasoning gains appear in practical tasks such as local education tools or medical query handling.
Load-bearing premise
The chosen multilingual benchmarks and the synthetic translated data are taken to represent real-world language use and needed capabilities for speakers of the twenty African languages.
What would settle it
Retraining the same models on native African-language web data or user-generated text instead of the synthetic mixes and observing no gains or reversed patterns on the same reasoning and translation benchmarks would falsify the central claim.
read the original abstract
Large language models (LLMs) are increasingly multilingual, yet open models continue to underperform relative to proprietary systems, with the gap most pronounced for African languages. Continued pre-training (CPT) offers a practical route to language adaptation, but improvements on demanding capabilities such as mathematical reasoning often remain limited. This limitation is driven in part by the uneven domain coverage and missing task-relevant knowledge that characterize many low-resource language corpora. We present \texttt{AfriqueLLM}, a suite of open LLMs adapted to 20 African languages through CPT on 26B tokens. We perform a comprehensive empirical study across five base models spanning sizes and architectures, including Llama 3.1, Gemma 3, and Qwen 3, and systematically analyze how CPT data composition shapes downstream performance. In particular, we vary mixtures that include math, code, and synthetic translated data, and evaluate the resulting models on a range of multilingual benchmarks. Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements, including on reasoning-oriented evaluations. Within a fixed architecture, larger models typically improve performance, but architectural choices dominate scale when comparing across model families. Moreover, strong multilingual performance in the base model does not reliably predict post-CPT outcomes; robust architectures coupled with task-aligned data provide a more dependable recipe. Finally, our best models improve long-context performance, including document-level translation. Models have been released on [Huggingface](https://huggingface.co/collections/McGill-NLP/afriquellm).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AfriqueLLM, a suite of open LLMs for 20 African languages created via continued pre-training (CPT) on 26B tokens across five base models (Llama 3.1, Gemma 3, Qwen 3 and variants). It conducts a systematic empirical study varying CPT data mixtures that incorporate math, code, and synthetic translated data, evaluating on multilingual benchmarks. The central claims are that data composition is the primary driver of CPT gains (with math+code+synthetic additions yielding consistent improvements, including on reasoning tasks), that architectural choices dominate scale when comparing across families, that strong base-model multilingual performance does not reliably predict post-CPT outcomes, and that the best models also improve long-context performance such as document-level translation. Models are released on Hugging Face.
Significance. If the empirical findings hold after addressing the synthetic-data concerns, the work offers practical, actionable guidance for adapting LLMs to low-resource African languages by prioritizing targeted data mixes over scale or base-model selection alone. The open release of models and the breadth of the architecture sweep constitute clear strengths that increase the potential impact for the multilingual NLP community.
major comments (2)
- [Data preparation / CPT mixtures] Data preparation section (methods describing the CPT mixtures): the manuscript provides no details on the generation method, quality filtering, or decontamination steps used for the synthetic translated data. Because the headline result attributes consistent gains on reasoning-oriented evaluations to the inclusion of this data, the absence of these safeguards leaves open the possibility that observed improvements reflect benchmark leakage or distributional overlap rather than genuine capability lift.
- [Results and analysis] Results section (analysis of data composition as primary driver): the claim that data composition dominates requires explicit ablation tables with statistical tests (e.g., paired significance tests or confidence intervals) comparing the incremental additions of math, code, and synthetic data against the base CPT corpus. Without these, it is difficult to quantify the effect sizes or rule out that architectural variance explains more of the variance than the data mixes.
minor comments (2)
- [Abstract and methods] The abstract states '26B tokens' but the full experimental description should report exact token counts and sampling ratios for each mixture variant to allow exact reproduction.
- [Figures and tables] Figure captions and table headers should explicitly list the base models and the precise benchmark suites used for each reported score to improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: Data preparation section (methods describing the CPT mixtures): the manuscript provides no details on the generation method, quality filtering, or decontamination steps used for the synthetic translated data. Because the headline result attributes consistent gains on reasoning-oriented evaluations to the inclusion of this data, the absence of these safeguards leaves open the possibility that observed improvements reflect benchmark leakage or distributional overlap rather than genuine capability lift.
Authors: We agree that these details were insufficiently documented. In the revised manuscript we have added a new subsection (Section 3.2.2) that fully specifies the synthetic data pipeline: translations were generated with NLLB-200 fine-tuned on high-quality African-language parallel corpora; quality filtering combined fastText language ID, perplexity thresholding against a held-out African corpus, and manual review of 500 random samples per language; decontamination removed all 5-gram overlaps (exact and fuzzy) with every evaluation benchmark using a custom script that we now release with the models. These steps were applied uniformly before mixing. revision: yes
-
Referee: Results section (analysis of data composition as primary driver): the claim that data composition dominates requires explicit ablation tables with statistical tests (e.g., paired significance tests or confidence intervals) comparing the incremental additions of math, code, and synthetic data against the base CPT corpus. Without these, it is difficult to quantify the effect sizes or rule out that architectural variance explains more of the variance than the data mixes.
Authors: We accept that the original analysis would benefit from more granular, statistically supported ablations. The revised manuscript now includes a dedicated ablation table (Table 4) that reports mean performance deltas and 95% bootstrap confidence intervals for each incremental addition (base CPT corpus, +math, +code, +synthetic) across all five base models. We further added paired Wilcoxon signed-rank tests (p < 0.05 after Bonferroni correction) confirming that the gains from the data additions are statistically significant and exceed the variance attributable to architecture within each model family. These results support our claim that data composition is the primary driver. revision: yes
Circularity Check
No circularity: purely empirical CPT study with direct benchmark evaluation
full rationale
The paper reports continued pre-training runs across data mixtures (math, code, synthetic translations) on five base models, followed by direct measurement of downstream performance on multilingual benchmarks. No equations, derivations, or first-principles claims exist that could reduce to fitted inputs or self-citations. All results are obtained by training and evaluating on held-out test sets; data composition effects are measured rather than defined into the outcome. This matches the default expectation of a non-circular empirical paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard machine-learning assumptions on data distribution, transfer, and benchmark validity hold for continued pre-training on low-resource languages.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our results identify data composition as the primary driver of CPT gains. Adding math, code, and synthetic translated data yields consistent improvements...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
DiM3 merges multilingual and multimodal model updates in a direction- and magnitude-aware way to enhance multilingual performance in vision-language models while preserving original multimodal abilities.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.