ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Pith reviewed 2026-05-24 02:40 UTC · model grok-4.3
The pith
Four new pre-trained models for Angolan languages outperform the prior best by 12.3 points through multilingual adaptive fine-tuning, OFA embedding initialization, and synthetic data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce four tailored PLMs for Angolan languages using MAFT and show that combining it with OFA embedding initialization and synthetic data produces measurable gains, reaching 12.3 points above AfroXLMR-base developed through MAFT and 3.8 points above OFA alone.
What carries the argument
Multilingual Adaptive Fine-tuning (MAFT) augmented by OFA embedding initialization and synthetic data, used to adapt models to Angolan languages.
If this is right
- Angolan languages now have dedicated models that can be used directly for downstream tasks.
- Embedding initialization techniques like OFA become a practical lever when adapting multilingual models to additional languages.
- Synthetic data can serve as a reliable supplement when natural text for a target language is scarce.
- The same MAFT-plus-initialization recipe may extend to other very-low-resource languages with similar data profiles.
Where Pith is reading between the lines
- The same initialization-plus-synthetic-data pattern could be tested on other African language clusters that share typological features with Angolan languages.
- If the gains hold under controlled hyperparameter sweeps, the method lowers the barrier for groups without access to massive natural corpora.
- Future work could measure how much of the improvement traces to better token coverage versus better semantic alignment from the synthetic examples.
Load-bearing premise
The reported gains come from the described combination of MAFT, OFA initialization, and synthetic data rather than from differences in training data volume, hyperparameters, or evaluation choices not detailed in the work.
What would settle it
Reproduce the MAFT runs while removing either the OFA initialization step or the synthetic data component and check whether the 12.3-point and 3.8-point margins over the respective baselines disappear.
read the original abstract
In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces four pre-trained language models for Angolan languages via Multilingual Adaptive Fine-tuning (MAFT). It examines the contribution of OFA embedding initialization and synthetic data to downstream task performance, reporting gains of 12.3 points over the SOTA AfroXLMR-base (MAFT) baseline and 3.8 points over an OFA-initialized model.
Significance. If the numerical gains can be shown to result from the proposed components under matched conditions, the work would help fill a gap for very-low-resource languages by supplying both models and evidence on effective initialization and data strategies for Angolan languages.
major comments (2)
- [Abstract, §4] Abstract and §4 (Experiments): The headline claim of 12.3-point and 3.8-point improvements requires that the new models were trained and evaluated under conditions identical to the AfroXLMR-base and OFA baselines (same data volume, optimizer schedule, batch size, number of steps, and exact downstream tasks/metrics). No such controls or reproduction details are provided, so the gains cannot yet be attributed to OFA initialization plus synthetic data rather than protocol differences.
- [§4.1, Table 1] §4.1 and Table 1: The description of the four new models and the MAFT procedure does not state the total token count, language sampling ratios, or whether the Angolan-language data overlap with the data used to train the cited AfroXLMR-base baseline.
minor comments (2)
- [Abstract] Abstract, sentence 4: 'We improve baseline over SOTA' is grammatically unclear; rephrase to 'We improve over the SOTA AfroXLMR-base baseline'.
- [§3] §3: The survey of embedding initialization and synthetic data would benefit from explicit citations to the exact prior works whose methods are being combined.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects of reproducibility. We address each major comment below and will revise the manuscript to strengthen the presentation of our experimental protocol.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (Experiments): The headline claim of 12.3-point and 3.8-point improvements requires that the new models were trained and evaluated under conditions identical to the AfroXLMR-base and OFA baselines (same data volume, optimizer schedule, batch size, number of steps, and exact downstream tasks/metrics). No such controls or reproduction details are provided, so the gains cannot yet be attributed to OFA initialization plus synthetic data rather than protocol differences.
Authors: We agree that the current manuscript lacks explicit side-by-side documentation of training hyperparameters and evaluation protocols. In the revision we will add a new subsection under Experiments that tabulates the exact data volume (in tokens), optimizer schedule, batch size, number of steps, and downstream task definitions used for both our models and the cited AfroXLMR-base and OFA baselines. This will make clear that the reported gains were obtained under matched conditions and can be attributed to the OFA initialization and synthetic-data components. revision: yes
-
Referee: [§4.1, Table 1] §4.1 and Table 1: The description of the four new models and the MAFT procedure does not state the total token count, language sampling ratios, or whether the Angolan-language data overlap with the data used to train the cited AfroXLMR-base baseline.
Authors: We acknowledge these omissions. The revised §4.1 and Table 1 will report (i) the total number of tokens seen during MAFT for each of the four models, (ii) the language sampling ratios employed in the multilingual mixture, and (iii) the degree of overlap between our Angolan-language corpora and the data used to train AfroXLMR-base, together with a brief discussion of any implications for the observed gains. revision: yes
Circularity Check
No circularity: empirical comparisons to external baselines
full rationale
The paper reports empirical performance gains (12.3 and 3.8 points) on downstream tasks for new Angolan PLMs. These rest on direct comparisons to independently developed external models (AfroXLMR-base via MAFT, and OFA). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims are falsifiable against external benchmarks and do not reduce to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multilingual Adaptive Fine-tuning can adapt existing models to new low-resource languages when combined with informed embedding initialization and synthetic data
Reference graph
Works this paper leans on
-
[1]
SERENGETI : Massively multilingual language models for A frica
Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Alcoba Inciarte. SERENGETI : Massively multilingual language models for A frica. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 1498--1537, Toronto, Canada, July 2023. Association for Computational ...
-
[2]
Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F
David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, ...
-
[3]
David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D ' souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujob...
-
[4]
Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee
David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects, 2023. URL https://arxiv.org/abs/2309.07445
-
[5]
Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow
Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...
work page 2022
-
[6]
Unsupervised Cross-lingual Representation Learning at Scale , booktitle =
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...
-
[7]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...
-
[8]
FOCUS : Effective embedding initialization for monolingual specialization of multilingual models
Konstantin Dobler and Gerard de Melo. FOCUS : Effective embedding initialization for monolingual specialization of multilingual models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 13440--13454, Singapore, December 2023. Association for Computational Lingu...
-
[9]
Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris Emezue. A fro LM : A self-active learning-based multilingual pretrained language model for 23 A frican languages. In Angela Fan, Iryna Gurevych, Yufang Hou, Zornitsa Kozareva, Sasha Luccioni, Nafise Sadat Mo...
-
[10]
Kshitij Gupta, Benjamin Th \'e rien, Adam Ibrahim, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Irina Rish, and Timoth \'e e Lesort. Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=pg7PUJe0Tl
work page 2023
-
[11]
Glot500: Scaling multilingual corpora and language models to 500 languages
Ayyoob ImaniGooghari, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, Andr \'e Martins, Fran c ois Yvon, and Hinrich Sch \"u tze. Glot500: Scaling multilingual corpora and language models to 500 languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61s...
-
[12]
Findings of the 2022 conference on machine translation ( WMT 22)
Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...
work page 2022
-
[13]
Yihong Liu, Peiqin Lin, Mingyang Wang, and Hinrich Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining, 2023 a . URL https://arxiv.org/abs/2311.08849
-
[14]
Yihong Liu, Haotian Ye, Leonie Weissweiler, Renhao Pei, and Hinrich Schuetze. Crosslingual transfer learning for low-resource languages based on multilingual colexification graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023 b . URL https://openreview.net/forum?id=Tn5hALAaA4
work page 2023
-
[15]
Taxi1500: A multilingual dataset for text classification in 1500 languages, 2023
Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Ehsaneddin Asgari, and Hinrich Schütze. Taxi1500: A multilingual dataset for text classification in 1500 languages, 2023. URL https://arxiv.org/abs/2305.08487
-
[16]
Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computation...
-
[17]
S em E val-2023 task 12: Sentiment analysis for A frican languages ( A fri S enti- S em E val)
Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Nedjma Ousidhoum, Abinew Ali Ayele, Saif Mohammad, Meriem Beloucif, and Sebastian Ruder. S em E val-2023 task 12: Sentiment analysis for A frican languages ( A fri S enti- S em E val). In Atul Kr. Ojha, A. Seza Do g ru \"o z, Giovanni Da San Martin...
-
[18]
No Language Left Behind: Scaling Human-Centered Machine Translation
NLLB-Team, Marta Ruiz Costa-juss \`a , James Cross, Onur cCelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Alison Youngblood, Bapi Akula, Lo \"i c Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadago...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning, pp.\ 116--126,...
-
[20]
UNK s everywhere: A dapting multilingual language models to new scripts
Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. UNK s everywhere: A dapting multilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 10186--10203, Online and Punta Cana, ...
-
[21]
Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. A fro MT : Pretraining strategies and reproducible benchmarks for translation of 8 A frican languages. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1306--1320, Online...
-
[22]
Edwin W. Smith. The classification of the bantu languages. by malcolm guthrie, ph.d. published for the international african institute by the oxford university press, 1948. pp. 91. map. 8s. 6d. net. Africa, 19 0 (1): 0 73–74, 1949. doi:10.2307/1156267
-
[23]
Data augmentation using back-translation for context-aware neural machine translation
Amane Sugiyama and Naoki Yoshinaga. Data augmentation using back-translation for context-aware neural machine translation. In Andrei Popescu-Belis, Sharid Lo \'a iciga, Christian Hardmeier, and Deyi Xiong (eds.), Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp.\ 35--44, Hong Kong, China, November 2019. Association...
-
[24]
Scale efficiently: Insights from pretraining and finetuning transformers
Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=f2OYVDyfIB
work page 2022
-
[25]
Improving pre-trained multilingual model with vocabulary expansion
Hai Wang, Dian Yu, Kai Sun, Jianshu Chen, and Dong Yu. Improving pre-trained multilingual model with vocabulary expansion. In Mohit Bansal and Aline Villavicencio (eds.), Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 316--327, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.1...
-
[26]
Expanding pretrained models to thousands more languages via lexicon-based adaptation
Xinyi Wang, Sebastian Ruder, and Graham Neubig. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 863--877, Dublin, Ireland, May 2022. ...
-
[27]
BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Ol...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Generalized data augmentation for low-resource translation
Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. Generalized data augmentation for low-resource translation. In Anna Korhonen, David Traum, and Llu \' s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 5786--5796, Florence, Italy, July 2019. Association for Computational...
-
[29]
m T 5: A massively multilingual pre-trained text-to-text transformer
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...
-
[30]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[31]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[32]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[33]
Hippocampus, Natalia Cerebro & Amelie P
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.