pith. sign in

arxiv: 2404.02534 · v2 · submitted 2024-04-03 · 💻 cs.CL · cs.AI

ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model

Pith reviewed 2026-05-24 02:40 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Angolan languagesmultilingual language modelsMAFTOFA embedding initializationsynthetic datalow-resource languagespre-trained language modelsadaptive fine-tuning
0
0 comments X

The pith

Four new pre-trained models for Angolan languages outperform the prior best by 12.3 points through multilingual adaptive fine-tuning, OFA embedding initialization, and synthetic data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates four language models specifically for Angolan languages by applying Multilingual Adaptive Fine-tuning to existing multilingual models. It tests the added value of OFA-style informed embedding initialization and the generation of synthetic data for these very-low-resource settings. The resulting models exceed the previous state-of-the-art AfroXLMR-base by 12.3 points and improve on an OFA baseline by 3.8 points on downstream tasks. A sympathetic reader would see this as evidence that targeted initialization and data augmentation can close performance gaps for languages previously left out of multilingual pre-training.

Core claim

The authors introduce four tailored PLMs for Angolan languages using MAFT and show that combining it with OFA embedding initialization and synthetic data produces measurable gains, reaching 12.3 points above AfroXLMR-base developed through MAFT and 3.8 points above OFA alone.

What carries the argument

Multilingual Adaptive Fine-tuning (MAFT) augmented by OFA embedding initialization and synthetic data, used to adapt models to Angolan languages.

If this is right

  • Angolan languages now have dedicated models that can be used directly for downstream tasks.
  • Embedding initialization techniques like OFA become a practical lever when adapting multilingual models to additional languages.
  • Synthetic data can serve as a reliable supplement when natural text for a target language is scarce.
  • The same MAFT-plus-initialization recipe may extend to other very-low-resource languages with similar data profiles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same initialization-plus-synthetic-data pattern could be tested on other African language clusters that share typological features with Angolan languages.
  • If the gains hold under controlled hyperparameter sweeps, the method lowers the barrier for groups without access to massive natural corpora.
  • Future work could measure how much of the improvement traces to better token coverage versus better semantic alignment from the synthetic examples.

Load-bearing premise

The reported gains come from the described combination of MAFT, OFA initialization, and synthetic data rather than from differences in training data volume, hyperparameters, or evaluation choices not detailed in the work.

What would settle it

Reproduce the MAFT runs while removing either the OFA initialization step or the synthetic data component and check whether the 12.3-point and 3.8-point margins over the respective baselines disappear.

read the original abstract

In recent years, the development of pre-trained language models (PLMs) has gained momentum, showcasing their capacity to transcend linguistic barriers and facilitate knowledge transfer across diverse languages. However, this progress has predominantly bypassed the inclusion of very-low resource languages, creating a notable void in the multilingual landscape. This paper addresses this gap by introducing four tailored PLMs specifically finetuned for Angolan languages, employing a Multilingual Adaptive Fine-tuning (MAFT) approach. In this paper, we survey the role of informed embedding initialization and synthetic data in enhancing the performance of MAFT models in downstream tasks. We improve baseline over SOTA AfroXLMR-base (developed through MAFT) and OFA (an effective embedding initialization) by 12.3 and 3.8 points respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces four pre-trained language models for Angolan languages via Multilingual Adaptive Fine-tuning (MAFT). It examines the contribution of OFA embedding initialization and synthetic data to downstream task performance, reporting gains of 12.3 points over the SOTA AfroXLMR-base (MAFT) baseline and 3.8 points over an OFA-initialized model.

Significance. If the numerical gains can be shown to result from the proposed components under matched conditions, the work would help fill a gap for very-low-resource languages by supplying both models and evidence on effective initialization and data strategies for Angolan languages.

major comments (2)
  1. [Abstract, §4] Abstract and §4 (Experiments): The headline claim of 12.3-point and 3.8-point improvements requires that the new models were trained and evaluated under conditions identical to the AfroXLMR-base and OFA baselines (same data volume, optimizer schedule, batch size, number of steps, and exact downstream tasks/metrics). No such controls or reproduction details are provided, so the gains cannot yet be attributed to OFA initialization plus synthetic data rather than protocol differences.
  2. [§4.1, Table 1] §4.1 and Table 1: The description of the four new models and the MAFT procedure does not state the total token count, language sampling ratios, or whether the Angolan-language data overlap with the data used to train the cited AfroXLMR-base baseline.
minor comments (2)
  1. [Abstract] Abstract, sentence 4: 'We improve baseline over SOTA' is grammatically unclear; rephrase to 'We improve over the SOTA AfroXLMR-base baseline'.
  2. [§3] §3: The survey of embedding initialization and synthetic data would benefit from explicit citations to the exact prior works whose methods are being combined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of reproducibility. We address each major comment below and will revise the manuscript to strengthen the presentation of our experimental protocol.

read point-by-point responses
  1. Referee: [Abstract, §4] Abstract and §4 (Experiments): The headline claim of 12.3-point and 3.8-point improvements requires that the new models were trained and evaluated under conditions identical to the AfroXLMR-base and OFA baselines (same data volume, optimizer schedule, batch size, number of steps, and exact downstream tasks/metrics). No such controls or reproduction details are provided, so the gains cannot yet be attributed to OFA initialization plus synthetic data rather than protocol differences.

    Authors: We agree that the current manuscript lacks explicit side-by-side documentation of training hyperparameters and evaluation protocols. In the revision we will add a new subsection under Experiments that tabulates the exact data volume (in tokens), optimizer schedule, batch size, number of steps, and downstream task definitions used for both our models and the cited AfroXLMR-base and OFA baselines. This will make clear that the reported gains were obtained under matched conditions and can be attributed to the OFA initialization and synthetic-data components. revision: yes

  2. Referee: [§4.1, Table 1] §4.1 and Table 1: The description of the four new models and the MAFT procedure does not state the total token count, language sampling ratios, or whether the Angolan-language data overlap with the data used to train the cited AfroXLMR-base baseline.

    Authors: We acknowledge these omissions. The revised §4.1 and Table 1 will report (i) the total number of tokens seen during MAFT for each of the four models, (ii) the language sampling ratios employed in the multilingual mixture, and (iii) the degree of overlap between our Angolan-language corpora and the data used to train AfroXLMR-base, together with a brief discussion of any implications for the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparisons to external baselines

full rationale

The paper reports empirical performance gains (12.3 and 3.8 points) on downstream tasks for new Angolan PLMs. These rest on direct comparisons to independently developed external models (AfroXLMR-base via MAFT, and OFA). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The central claims are falsifiable against external benchmarks and do not reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that MAFT plus OFA initialization plus synthetic data will transfer effectively to Angolan languages; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Multilingual Adaptive Fine-tuning can adapt existing models to new low-resource languages when combined with informed embedding initialization and synthetic data
    The paper's approach depends on this transfer-learning premise stated in the abstract.

pith-pipeline@v0.9.0 · 5667 in / 1146 out tokens · 45170 ms · 2026-05-24T02:40:49.733439+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    SERENGETI : Massively multilingual language models for A frica

    Ife Adebara, AbdelRahim Elmadany, Muhammad Abdul-Mageed, and Alcides Alcoba Inciarte. SERENGETI : Massively multilingual language models for A frica. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.\ 1498--1537, Toronto, Canada, July 2023. Association for Computational ...

  2. [2]

    Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F

    David Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beukman, Chester Palen-Michel, Constantine Lignos, Jesujoba Alabi, Shamsuddeen Muhammad, Peter Nabende, Cheikh M. Bamba Dione, Andiswa Bukula, Rooweither Mabuya, Bonaventure F. P. Dossou, Blessing Sibanda, Happy Buzaaba, Jonathan Mukiibi, Godson Kalipe, Derguene Mbaye, Amelia Taylor, ...

  3. [3]

    David Ifeoluwa Adelani, Jade Abbott, Graham Neubig, Daniel D ' souza, Julia Kreutzer, Constantine Lignos, Chester Palen-Michel, Happy Buzaaba, Shruti Rijhwani, Sebastian Ruder, Stephen Mayhew, Israel Abebe Azime, Shamsuddeen H. Muhammad, Chris Chinenye Emezue, Joyce Nakatumba-Nabende, Perez Ogayo, Aremu Anuoluwapo, Catherine Gitau, Derguene Mbaye, Jesujob...

  4. [4]

    Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee

    David Ifeoluwa Adelani, Hannah Liu, Xiaoyu Shen, Nikita Vassilyev, Jesujoba O. Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee. Sib-200: A simple, inclusive, and big evaluation dataset for topic classification in 200+ languages and dialects, 2023. URL https://arxiv.org/abs/2309.07445

  5. [5]

    Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow

    Jesujoba O. Alabi, David Ifeoluwa Adelani, Marius Mosbach, and Dietrich Klakow. Adapting pre-trained language models to A frican languages via multilingual adaptive fine-tuning. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Pag...

  6. [6]

    Unsupervised Cross-lingual Representation Learning at Scale , booktitle =

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the As...

  7. [7]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...

  8. [8]

    FOCUS : Effective embedding initialization for monolingual specialization of multilingual models

    Konstantin Dobler and Gerard de Melo. FOCUS : Effective embedding initialization for monolingual specialization of multilingual models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 13440--13454, Singapore, December 2023. Association for Computational Lingu...

  9. [9]

    Bonaventure F. P. Dossou, Atnafu Lambebo Tonja, Oreen Yousuf, Salomey Osei, Abigail Oppong, Iyanuoluwa Shode, Oluwabusayo Olufunke Awoyomi, and Chris Emezue. A fro LM : A self-active learning-based multilingual pretrained language model for 23 A frican languages. In Angela Fan, Iryna Gurevych, Yufang Hou, Zornitsa Kozareva, Sasha Luccioni, Nafise Sadat Mo...

  10. [10]

    Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023

    Kshitij Gupta, Benjamin Th \'e rien, Adam Ibrahim, Mats Leon Richter, Quentin Gregory Anthony, Eugene Belilovsky, Irina Rish, and Timoth \'e e Lesort. Continual pre-training of large language models: How to re-warm your model? In Workshop on Efficient Systems for Foundation Models @ ICML2023, 2023. URL https://openreview.net/forum?id=pg7PUJe0Tl

  11. [11]

    Glot500: Scaling multilingual corpora and language models to 500 languages

    Ayyoob ImaniGooghari, Peiqin Lin, Amir Hossein Kargaran, Silvia Severini, Masoud Jalili Sabet, Nora Kassner, Chunlan Ma, Helmut Schmid, Andr \'e Martins, Fran c ois Yvon, and Hinrich Sch \"u tze. Glot500: Scaling multilingual corpora and language models to 500 languages. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61s...

  12. [12]

    Findings of the 2022 conference on machine translation ( WMT 22)

    Tom Kocmi, Rachel Bawden, Ond r ej Bojar, Anton Dvorkovich, Christian Federmann, Mark Fishel, Thamme Gowda, Yvette Graham, Roman Grundkiewicz, Barry Haddow, Rebecca Knowles, Philipp Koehn, Christof Monz, Makoto Morishita, Masaaki Nagata, Toshiaki Nakazawa, Michal Nov \'a k, Martin Popel, and Maja Popovi \'c . Findings of the 2022 conference on machine tra...

  13. [13]

    Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining, 2023 a

    Yihong Liu, Peiqin Lin, Mingyang Wang, and Hinrich Schütze. Ofa: A framework of initializing unseen subword embeddings for efficient large-scale multilingual continued pretraining, 2023 a . URL https://arxiv.org/abs/2311.08849

  14. [14]

    Crosslingual transfer learning for low-resource languages based on multilingual colexification graphs

    Yihong Liu, Haotian Ye, Leonie Weissweiler, Renhao Pei, and Hinrich Schuetze. Crosslingual transfer learning for low-resource languages based on multilingual colexification graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023 b . URL https://openreview.net/forum?id=Tn5hALAaA4

  15. [15]

    Taxi1500: A multilingual dataset for text classification in 1500 languages, 2023

    Chunlan Ma, Ayyoob ImaniGooghari, Haotian Ye, Ehsaneddin Asgari, and Hinrich Schütze. Taxi1500: A multilingual dataset for text classification in 1500 languages, 2023. URL https://arxiv.org/abs/2305.08487

  16. [16]

    WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models

    Benjamin Minixhofer, Fabian Paischer, and Navid Rekabsaz. WECHSEL : Effective initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computation...

  17. [17]

    S em E val-2023 task 12: Sentiment analysis for A frican languages ( A fri S enti- S em E val)

    Shamsuddeen Hassan Muhammad, Idris Abdulmumin, Seid Muhie Yimam, David Ifeoluwa Adelani, Ibrahim Said Ahmad, Nedjma Ousidhoum, Abinew Ali Ayele, Saif Mohammad, Meriem Beloucif, and Sebastian Ruder. S em E val-2023 task 12: Sentiment analysis for A frican languages ( A fri S enti- S em E val). In Atul Kr. Ojha, A. Seza Do g ru \"o z, Giovanni Da San Martin...

  18. [18]

    No Language Left Behind: Scaling Human-Centered Machine Translation

    NLLB-Team, Marta Ruiz Costa-juss \`a , James Cross, Onur cCelebi, Maha Elbayad, Kenneth Heafield, Kevin Heffernan, Elahe Kalbassi, Janice Lam, Daniel Licht, Jean Maillard, Anna Sun, Skyler Wang, Guillaume Wenzek, Alison Youngblood, Bapi Akula, Lo \"i c Barrault, Gabriel Mejia Gonzalez, Prangthip Hansanti, John Hoffman, Semarley Jarrett, Kaushik Ram Sadago...

  19. [19]

    Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages

    Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? no problem! exploring the viability of pretrained multilingual language models for low-resourced languages. In Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (eds.), Proceedings of the 1st Workshop on Multilingual Representation Learning, pp.\ 116--126,...

  20. [20]

    UNK s everywhere: A dapting multilingual language models to new scripts

    Jonas Pfeiffer, Ivan Vuli \'c , Iryna Gurevych, and Sebastian Ruder. UNK s everywhere: A dapting multilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 10186--10203, Online and Punta Cana, ...

  21. [21]

    A fro MT : Pretraining strategies and reproducible benchmarks for translation of 8 A frican languages

    Machel Reid, Junjie Hu, Graham Neubig, and Yutaka Matsuo. A fro MT : Pretraining strategies and reproducible benchmarks for translation of 8 A frican languages. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 1306--1320, Online...

  22. [22]

    Edwin W. Smith. The classification of the bantu languages. by malcolm guthrie, ph.d. published for the international african institute by the oxford university press, 1948. pp. 91. map. 8s. 6d. net. Africa, 19 0 (1): 0 73–74, 1949. doi:10.2307/1156267

  23. [23]

    Data augmentation using back-translation for context-aware neural machine translation

    Amane Sugiyama and Naoki Yoshinaga. Data augmentation using back-translation for context-aware neural machine translation. In Andrei Popescu-Belis, Sharid Lo \'a iciga, Christian Hardmeier, and Deyi Xiong (eds.), Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp.\ 35--44, Hong Kong, China, November 2019. Association...

  24. [24]

    Scale efficiently: Insights from pretraining and finetuning transformers

    Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler. Scale efficiently: Insights from pretraining and finetuning transformers. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=f2OYVDyfIB

  25. [25]

    Improving pre-trained multilingual model with vocabulary expansion

    Hai Wang, Dian Yu, Kai Sun, Jianshu Chen, and Dong Yu. Improving pre-trained multilingual model with vocabulary expansion. In Mohit Bansal and Aline Villavicencio (eds.), Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp.\ 316--327, Hong Kong, China, November 2019. Association for Computational Linguistics. doi:10.1...

  26. [26]

    Expanding pretrained models to thousands more languages via lexicon-based adaptation

    Xinyi Wang, Sebastian Ruder, and Graham Neubig. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 863--877, Dublin, Ireland, May 2022. ...

  27. [27]

    BigScience Workshop, :, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Ol...

  28. [28]

    Generalized data augmentation for low-resource translation

    Mengzhou Xia, Xiang Kong, Antonios Anastasopoulos, and Graham Neubig. Generalized data augmentation for low-resource translation. In Anna Korhonen, David Traum, and Llu \' s M \`a rquez (eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.\ 5786--5796, Florence, Italy, July 2019. Association for Computational...

  29. [29]

    m T 5: A massively multilingual pre-trained text-to-text transformer

    Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. m T 5: A massively multilingual pre-trained text-to-text transformer. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (eds.), Procee...

  30. [30]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  31. [31]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  32. [32]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  33. [33]

    Hippocampus, Natalia Cerebro & Amelie P

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...