pith. sign in

arxiv: 2606.02100 · v1 · pith:EH2WS2IUnew · submitted 2026-06-01 · 💻 cs.CL

PortBERT: Navigating the Depths of Portuguese Language Models

Pith reviewed 2026-06-28 14:41 UTC · model grok-4.3

classification 💻 cs.CL
keywords Portuguese language modelsRoBERTaExtraGLUEmodel efficiencypre-trainingtransformer modelsmonolingual NLP
0
0 comments X

The pith

PortBERT base and large models match or exceed prior Portuguese NLP performance on translated GLUE tasks while documenting efficiency metrics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PortBERT as a family of RoBERTa-based language models built specifically for Portuguese to balance accuracy with practical training and inference costs. It trains two sizes from scratch on more than 450 GB of filtered and deduplicated Portuguese text drawn from mC4 and OSCAR23 using byte-level BPE and the fairseq library on both GPU and TPU hardware. Evaluation uses ExtraGLUE, a collection of translated English GLUE and SuperGLUE tasks. The base and large variants reach or surpass the scores of existing monolingual and multilingual models on these tasks. The work additionally supplies concrete measurements of training duration, inference speed, and fine-tuning throughput to highlight compute-performance tradeoffs that earlier Portuguese models had left largely unexamined.

Core claim

PortBERT consists of two RoBERTa-style transformer models trained from scratch on a large Portuguese corpus; when evaluated on the translated ExtraGLUE benchmark the base and large variants match or surpass the accuracy of prior monolingual and multilingual models while the authors also record training times, inference latency, and fine-tuning throughput to quantify efficiency.

What carries the argument

PortBERT, a pair of RoBERTa-based transformer language models trained from scratch on deduplicated Portuguese text with byte-level BPE tokenization and stable pre-training routines.

If this is right

  • PortBERT base and large reach competitive or higher accuracy than prior models on the ExtraGLUE suite of Portuguese tasks.
  • Training, inference, and fine-tuning throughput numbers are reported, allowing direct efficiency comparisons with other models.
  • Public release of Hugging Face weights and fairseq checkpoints makes the models immediately usable for downstream Portuguese applications.
  • The emphasis on compute-performance tradeoffs supplies a practical complement to earlier Portuguese models that focused mainly on scale or peak accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported efficiency numbers could help practitioners choose a model size that fits available hardware without sacrificing benchmark scores.
  • The same data-filtering and hardware-agnostic training approach might be reused for other languages where large clean corpora exist but dedicated models are scarce.
  • If native Portuguese benchmarks later show different relative rankings, the current ExtraGLUE results would need re-interpretation rather than direct transfer.

Load-bearing premise

Translated English GLUE and SuperGLUE tasks provide a faithful measure of Portuguese language understanding without meaningful distortion from translation or cultural mismatch.

What would settle it

New results on native, untranslated Portuguese understanding tasks that place PortBERT below the strongest existing models, or direct evidence that translation artifacts systematically inflate or deflate ExtraGLUE scores.

Figures

Figures reproduced from arXiv: 2606.02100 by Armando B. Mendes, Henry He, Raphael Scheible-Schmitt.

Figure 1
Figure 1. Figure 1: Performance–throughput trade-off across models. The top plot shows the relationship between average [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Perplexity of the PortBERT models. Top based on a validation at the checkpoints. Bottom based on the validation of each optimization cycle during the training. C Efficiency Measurements Tables 6 and 7 report detailed runtime statistics for all models and tasks [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces PortBERT, a family of RoBERTa-based language models for Portuguese trained from scratch on over 450 GB of deduplicated mC4 and OSCAR23 data using fairseq with byte-level BPE. It releases base and large variants and evaluates them on ExtraGLUE (translated GLUE and SuperGLUE tasks), claiming competitive or superior performance relative to existing monolingual and multilingual models while also reporting training/inference times and fine-tuning throughput to highlight efficiency tradeoffs.

Significance. If the performance claims are substantiated, the work fills a gap in efficient Portuguese-specific models by emphasizing compute-performance balance and publicly releasing models on Hugging Face plus fairseq checkpoints, which supports reproducibility and further research in an underexplored language.

major comments (2)
  1. [Abstract] Abstract: the claim that 'both models perform competitively, matching or surpassing existing monolingual and multilingual models' on ExtraGLUE supplies no numerical scores, baseline details, statistical tests, or error bars, preventing verification of the central empirical claim.
  2. [Abstract / Evaluation] Evaluation (ExtraGLUE description): the paper states that tasks were translated but provides no evidence of translation-quality controls such as back-translation checks, human fidelity ratings, or side-by-side comparison against native Portuguese benchmarks; without this, translation artifacts remain a plausible confound that could invalidate ExtraGLUE as a faithful proxy for Portuguese understanding.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. The comments highlight opportunities to strengthen the abstract and evaluation section, and we address each point below with proposed revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'both models perform competitively, matching or surpassing existing monolingual and multilingual models' on ExtraGLUE supplies no numerical scores, baseline details, statistical tests, or error bars, preventing verification of the central empirical claim.

    Authors: We agree that the abstract would benefit from concrete numerical support. In the revised manuscript we will update the abstract to reference key results from the evaluation section, including average ExtraGLUE scores for both PortBERT variants and direct comparisons to the main baselines (BERTimbau, mBERT, XLM-R). The full per-task scores, standard deviations where available, and baseline details remain in the tables and text; the abstract change will direct readers to these results for verification. revision: yes

  2. Referee: [Abstract / Evaluation] Evaluation (ExtraGLUE description): the paper states that tasks were translated but provides no evidence of translation-quality controls such as back-translation checks, human fidelity ratings, or side-by-side comparison against native Portuguese benchmarks; without this, translation artifacts remain a plausible confound that could invalidate ExtraGLUE as a faithful proxy for Portuguese understanding.

    Authors: This observation is correct: the manuscript describes ExtraGLUE as translated tasks but supplies no additional quality-control evidence. We will revise the evaluation section to describe the translation pipeline used, explicitly note the absence of back-translation or human fidelity checks as a limitation, and discuss how this setup aligns with prior Portuguese NLP work that relies on the same translated benchmarks. These additions will improve transparency without altering the reported experimental results. revision: yes

Circularity Check

0 steps flagged

No derivation chain present; empirical model training and benchmark evaluation

full rationale

The paper describes training RoBERTa-based models on Portuguese corpora and evaluating them on translated GLUE/SuperGLUE tasks (ExtraGLUE). No equations, derivations, fitted parameters, or predictions are claimed. All performance statements rest on direct external benchmark comparisons rather than any internal reduction or self-referential construction. No self-citation load-bearing steps or ansatz smuggling occur. The contribution is a standard empirical release and is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated beyond the implicit assumption that standard RoBERTa pre-training on filtered web text yields useful Portuguese representations.

pith-pipeline@v0.9.1-grok · 5728 in / 1040 out tokens · 29222 ms · 2026-06-28T14:41:07.315468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

298 extracted references · 119 canonical work pages · 2 internal anchors

  1. [2]

    HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =

    Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and R. HuggingFace's Transformers: State-of-the-art Natural Language Processing , journal =. 2019 , url =

  2. [3]

    Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition , pages=

    Benikova, Darina and Biemann, Chris and Kisselew, Max and Padó, Sebastian , year =. Proceedings of the KONVENS GermEval Shared Task on Named Entity Recognition , pages=

  3. [4]

    Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, Rémi and Funtowicz, Morgan and Brew, Jamie , month = oct, year =

  4. [5]

    arXiv:1904.03323 [cs] , author =

    Publicly. arXiv:1904.03323 [cs] , author =. 2019 , note =

  5. [6]

    arXiv:1402.3722 [cs, stat] , author =

    word2vec. arXiv:1402.3722 [cs, stat] , author =. 2014 , note =

  6. [7]

    arXiv:1301.3781 [cs] , author =

    Efficient. arXiv:1301.3781 [cs] , author =. 2013 , note =

  7. [8]

    arXiv:1508.07709 [cs, stat] , author =

    Word. arXiv:1508.07709 [cs, stat] , author =. 2016 , note =

  8. [9]

    arXiv:1905.05583 [cs] , author =

    How to. arXiv:1905.05583 [cs] , author =. 2019 , note =

  9. [10]

    Medium , author =

    Meet. Medium , author =. 2019 , file =

  10. [11]

    2020 , note =

    arXiv:1909.11942 [cs] , author =. 2020 , note =

  11. [12]

    2020 , note =

    arXiv:1906.08237 [cs] , author =. 2020 , note =

  12. [13]

    2020 , note =

    arXiv:2001.06286 [cs] , author =. 2020 , note =

  13. [14]

    2019 , note =

    arXiv:1907.11692 [cs] , author =. 2019 , note =

  14. [15]

    arXiv:2001.04451 [cs, stat] , author =

    Reformer:. arXiv:2001.04451 [cs, stat] , author =. 2020 , note =

  15. [16]

    arXiv:1907.13528 [cs] , author =

    What. arXiv:1907.13528 [cs] , author =. 2019 , note =

  16. [17]

    2019 , note =

    arXiv:1912.09582 [cs] , author =. 2019 , note =

  17. [18]

    Wu, Shijie and Dredze, Mark , month = nov, year =. Beto,. Proceedings of the 2019. doi:10.18653/v1/D19-1077 , abstract =

  18. [19]

    2020 , note =

    arXiv:1912.06638 [cs] , author =. 2020 , note =

  19. [20]

    arXiv:1904.02099 [cs] , author =

    75. arXiv:1904.02099 [cs] , author =. 2019 , note =

  20. [21]

    arXiv:1611.01734 [cs] , author =

    Deep. arXiv:1611.01734 [cs] , author =. 2017 , note =

  21. [22]

    arXiv:1901.07291 [cs] , author =

    Cross-lingual. arXiv:1901.07291 [cs] , author =. 2019 , note =

  22. [23]

    arXiv:1804.10959 [cs] , author =

    Subword. arXiv:1804.10959 [cs] , author =. 2018 , note =

  23. [24]

    Proceedings of the 2018

    Kudo, Taku and Richardson, John , month = nov, year =. Proceedings of the 2018. doi:10.18653/v1/D18-2012 , abstract =

  24. [25]

    arXiv:1904.00962 [cs, stat] , author =

    Large. arXiv:1904.00962 [cs, stat] , author =. 2020 , note =

  25. [26]

    2020 , note =

    arXiv:1907.10529 [cs] , author =. 2020 , note =

  26. [27]

    arXiv:1908.08962 [cs] , author =

    Well-. arXiv:1908.08962 [cs] , author =. 2019 , note =

  27. [28]

    arXiv:1906.08101 [cs] , author =

    Pre-. arXiv:1906.08101 [cs] , author =. 2019 , note =

  28. [29]

    OpenAI Blog , author =

    Language models are unsupervised multitask learners , volume =. OpenAI Blog , author =. 2019 , pages =

  29. [30]

    attardi/wikiextractor , url =

    Attardi, Giuseppe , month = may, year =. attardi/wikiextractor , url =

  30. [31]

    2020 , note =

    musixmatchresearch/umberto , copyright =. 2020 , note =

  31. [32]

    deepset -

    Chan, Branden and Möller, Timo and Pietsch, Malte and Soni, Tanay and Yeung, Chin Man , note =. deepset -

  32. [33]

    2020 , note =

    deepset-ai/. 2020 , note =

  33. [34]

    and Trenkle, John M

    Cavnar, William B. and Trenkle, John M. , year =. N-. In

  34. [35]

    Qualität der

    Hammwöhner, Rainer and Fuchs, Karl-Peter and Kattenbeck, Markus and Sax, Christian , editor =. Qualität der. Open. 2007 , pages =

  35. [36]

    kommunikation @ gesellschaft , author =

    Qualitätsaspekte der. kommunikation @ gesellschaft , author =. 2007 , keywords =

  36. [37]

    arXiv:1806.03822 [cs] , author =

    Know. arXiv:1806.03822 [cs] , author =. 2018 , note =

  37. [38]

    and De Meulder, Fien , year =

    Tjong Kim Sang, Erik F. and De Meulder, Fien , year =. Introduction to the. doi:10.3115/1119176.1119195 , booktitle =

  38. [39]

    Risch, Julian and Krebs, Eva and Löser, Alexander and Riese, Alexander and Krestel, Ralf , month = sep, year =. Fine-. Proceedings of

  39. [40]

    2020 , note =

    Medium , author =. 2020 , note =

  40. [41]

    Unsupervised Cross-lingual Representation Learning at Scale , journal =

    Alexis Conneau and Kartikay Khandelwal and Naman Goyal and Vishrav Chaudhary and Guillaume Wenzek and Francisco Guzm. Unsupervised Cross-lingual Representation Learning at Scale , journal =. 2019 , url =

  41. [42]

    Cross-lingual Language Model Pretraining , url =

    Conneau, Alexis and Lample, Guillaume , booktitle =. Cross-lingual Language Model Pretraining , url =

  42. [43]

    arXiv:1912.07076 [cs] , author =

    Multilingual is not enough:. arXiv:1912.07076 [cs] , author =. 2019 , note =

  43. [44]

    Introduction to

    Potapov, Sergey , month = jul, year =. Introduction to

  44. [45]

    arXiv:1904.01038 [cs] , author =

    fairseq:. arXiv:1904.01038 [cs] , author =. 2019 , note =

  45. [46]

    Språktidningen , author =

    Små bokstäver ökade avståndet till tyskarna , url =. Språktidningen , author =. 2009 , note =

  46. [47]

    Crystal, David and Crystal, Honorary Professor of Linguistics David , month = aug, year =. The

  47. [48]

    arXiv:1806.00187 [cs] , author =

    Scaling. arXiv:1806.00187 [cs] , author =. 2018 , note =

  48. [49]

    arXiv:1901.08256 [cs, stat] , author =

    Large-. arXiv:1901.08256 [cs, stat] , author =. 2019 , note =

  49. [50]

    Lexical and orthographic distances between

    Gooskens, Charlotte and Bezooijen, Renée van , year =. Lexical and orthographic distances between. doi:10.3726/978-3-653-03517-9/8 , abstract =

  50. [51]

    arXiv:2005.14165 [cs] , author =

    Language. arXiv:2005.14165 [cs] , author =. 2020 , note =

  51. [52]

    2020 , note =

    arXiv:1912.05372 [cs] , author =. 2020 , note =

  52. [53]

    Wikipedia , month = nov, year =

    Deutsche. Wikipedia , month = nov, year =

  53. [54]

    Wikipedia , month = oct, year =

    Wikipedia:. Wikipedia , month = oct, year =

  54. [55]

    and Herring, S.C

    Emigh, W. and Herring, S.C. , month = jan, year =. Collaborative. Proceedings of the 38th. doi:10.1109/HICSS.2005.149 , abstract =

  55. [56]

    Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =

    Suárez, Pedro Javier Ortiz and Sagot, Benoît and Romary, Laurent , editor =. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =. 2019 , pages =. doi:10.14618/ids-pub-9021 , abstract =

  56. [57]

    Recent advances in natural language processing , author =

    News from. Recent advances in natural language processing , author =. 2009 , pages =

  57. [58]

    Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ondřej and Constantin, Alexandra and Herbst, Evan , month = jun, year =. Moses:. Proceedings of the 45th

  58. [59]

    Schabus, Dietmar and Skowron, Marcin and Trapp, Martin , month = aug, year =. One. doi:10.1145/3077136.3080711 , booktitle =

  59. [60]

    Academic-

    Schabus, Dietmar and Skowron, Marcin , month = may, year =. Academic-. Proceedings of the 11th

  60. [61]

    2016 , note =

    arXiv:1606.05250 [cs] , author =. 2016 , note =

  61. [62]

    , year =

    Jurafsky, Daniel and Martin, James H. , year =. Speech and

  62. [63]

    Information Processing and Management of Uncertainty in Knowledge-Based Systems , author =

    Automatic. Information Processing and Management of Uncertainty in Knowledge-Based Systems , author =. 2020 , pmid =. doi:10.1007/978-3-030-50146-4_52 , abstract =

  63. [64]

    Proceedings of the 58th

    Martin, Louis and Muller, Benjamin and Ortiz Suárez, Pedro Javier and Dupont, Yoann and Romary, Laurent and de la Clergerie, \'. Proceedings of the 58th. 2020 , pages =

  64. [65]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , month = jun, year =. Proceedings of the 2019. doi:10.18653/v1/N19-1423 , abstract =

  65. [66]

    Attention is

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Łukasz and Polosukhin, Illia , editor =. Attention is. Advances in. 2017 , pages =

  66. [67]

    GloVe: Global vectors for word representation,

    Pennington, Jeffrey and Socher, Richard and Manning, Christopher , month = oct, year =. Proceedings of the 2014. doi:10.3115/v1/D14-1162 , urldate =

  67. [68]

    Transactions of the Association for Computational Linguistics , author =

    Enriching. Transactions of the Association for Computational Linguistics , author =. 2017 , pages =

  68. [69]

    arXiv preprint arXiv:1612.03651 , author =

  69. [70]

    Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas , month = apr, year =. Bag of. Proceedings of the 15th

  70. [71]

    Ortiz Suárez, Pedro Javier and Romary, Laurent and Sagot, Benoît , month = jul, year =. A. Proceedings of the 58th

  71. [72]

    arXiv:2002.06305 [cs] , author =

    Fine-. arXiv:2002.06305 [cs] , author =. 2020 , note =

  72. [73]

    Advances in

    Mikolov, Tomas and Grave, Edouard and Bojanowski, Piotr and Puhrsch, Christian and Joulin, Armand , month = may, year =. Advances in. Proceedings of the

  73. [75]

    Proceedings of the 2019

    Akbik, Alan and Bergmann, Tanja and Blythe, Duncan and Rasul, Kashif and Schweter, Stefan and Vollgraf, Roland , month = jun, year =. Proceedings of the 2019. doi:10.18653/v1/N19-4010 , abstract =

  74. [76]

    Facebook

    Ng, Nathan and Yee, Kyra and Baevski, Alexei and Ott, Myle and Auli, Michael and Edunov, Sergey , month = aug, year =. Facebook. Proceedings of the. doi:10.18653/v1/W19-5333 , abstract =

  75. [77]

    arXiv:1508.07909 [cs] , author =

    Neural. arXiv:1508.07909 [cs] , author =. 2016 , note =

  76. [78]

    Japanese and

    Schuster, Mike and Nakajima, Kaisuke , month = mar, year =. Japanese and. 2012. doi:10.1109/ICASSP.2012.6289079 , abstract =

  77. [79]

    GitHub , author =

    Multilingual. GitHub , author =. 2018 , file =

  78. [80]

    Dagstuhl-Seminar 99121: Unsupervised Learning , pages=

    Single-class support vector machines , author=. Dagstuhl-Seminar 99121: Unsupervised Learning , pages=. 1999 , organization=

  79. [81]

    German's Next Language Model , journal =

    Branden Chan and Stefan Schweter and Timo M. German's Next Language Model , journal =. 2020 , url =. 2010.10906 , timestamp =

  80. [82]

    MarIA: Spanish Language Models , ISSN=

    Gutiérrez-Fandiño, Asier and Armengol-Estapé, Jordi and Pàmies, Marc and Llop-Palao, Joan and Silveira-Ocampo, Joaquin and Carrino, Casimiro Pio and Armentano-Oller, Carme and Rodriguez-Penagos, Carlos and Gonzalez-Agirre, Aitor and Villegas, Marta , year=. MarIA: Spanish Language Models , ISSN=. doi:10.26342/2022-68-3 , journal=

Showing first 80 references.