Recognition: unknown
NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
Pith reviewed 2026-05-09 20:49 UTC · model grok-4.3
The pith
NorBERTo is a ModernBERT encoder for Portuguese trained on a new 331 billion token corpus that leads other encoders on entailment and similarity tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NorBERTo is introduced as a modern encoder derived from the ModernBERT architecture with long-context support and efficient attention mechanisms, trained on the Aurora-PT corpus of 331 billion GPT-2 tokens collected from diverse Brazilian Portuguese web sources and prior multilingual data. On the PLUE benchmark suite the large variant achieves 0.9191 F1 on MRPC and 0.7689 accuracy on RTE, the best results among the encoders evaluated. On ASSIN 2 it records the highest entailment F1 of approximately 0.904 among encoders, while Aurora-PT is positioned as the largest open monolingual Portuguese resource.
What carries the argument
NorBERTo, the ModernBERT-derived encoder trained on the Aurora-PT corpus, which supplies efficient attention and long-context handling to improve Portuguese text processing.
Load-bearing premise
The web-sourced Aurora-PT corpus is clean, diverse, and free of noise or biases so that benchmark gains reflect genuine language-modeling advances rather than data artifacts.
What would settle it
Retraining the identical NorBERTo architecture on a randomly sampled smaller subset of Aurora-PT and measuring whether F1 on MRPC, accuracy on RTE, and entailment F1 on ASSIN 2 drop substantially would test whether the full corpus scale is required for the reported gains.
read the original abstract
High-quality corpora are essential for advancing Natural Language Processing (NLP) in Portuguese. Building on previous encoder-only models such as BERTimbau and Albertina PT-BR, we introduce NorBERTo, a modern encoder based on the ModernBERT architecture, featuring long-context support and efficient attention mechanisms. NorBERTo is trained on Aurora-PT, a newly curated Brazilian Portuguese corpus comprising 331 billion GPT-2 tokens collected from diverse web sources and existing multilingual datasets. We systematically benchmark NorBERTo against Strong baselines on semantic similarity, textual entailment and classification tasks using standardized datasets such as ASSIN 2 and PLUE. On PLUE, NorBERTo-large achieves the best results among the encoder models we evaluated, notably reaching 0.9191 F1 on MRPC and 0.7689 accuracy on RTE. On ASSIN 2, NorBERTo-large attains the highest entailment F1 (~0.904) among all encoders considered, although Albertina-900M and BERTimbau-large still hold an advantage. To the best of our knowledge, Aurora-PT is currently the largest openly available monolingual Portuguese corpus, surpassing previous resources. NorBERTo provides a modern, mid-sized encoder designed for realistic deployment scenarios: it is straight-forward to fine-tune, efficient to serve, and well suited as a backbone for retrieval-augmented generation and other downstream Portuguese NLP systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces NorBERTo, a ModernBERT-based encoder model for Portuguese pretrained on the new Aurora-PT corpus of 331 billion tokens assembled from web sources and multilingual datasets. It claims NorBERTo-large achieves the best results among evaluated encoders on PLUE, with 0.9191 F1 on MRPC and 0.7689 accuracy on RTE, plus the highest entailment F1 (~0.904) on ASSIN 2, while positioning Aurora-PT as the largest openly available monolingual Portuguese corpus.
Significance. If the benchmark gains reflect genuine capability rather than artifacts, the work is significant for Portuguese NLP by delivering a substantially larger open pretraining resource and a modern, efficient encoder with long-context support that improves on prior models like BERTimbau and Albertina. The empirical focus on standardized tasks and positioning for downstream uses such as RAG adds practical value.
major comments (2)
- [Corpus description] Corpus description (methods section): The paper provides no details on filtering, deduplication, or decontamination of Aurora-PT, nor any overlap statistics with the PLUE and ASSIN 2 test sets. Since the corpus is assembled from diverse web sources, the absence of these checks leaves open the possibility of benchmark contamination, which directly undermines the central claim that the reported deltas (e.g., MRPC F1 0.9191, RTE acc 0.7689, ASSIN 2 entailment F1 ~0.904) demonstrate superior model quality.
- [Experimental setup] Experimental setup (results and methods): No training hyperparameters, fine-tuning protocols, statistical significance tests, or error analysis are reported. This prevents verification of the claim that NorBERTo-large sets new encoder SOTA on the cited tasks and makes the comparisons to baselines (Albertina-900M, BERTimbau-large) non-reproducible.
minor comments (2)
- [Abstract] Abstract: The statement that 'Albertina-900M and BERTimbau-large still hold an advantage' is vague; it should specify on which tasks or metrics this holds to clarify the overall comparison.
- [Results] The manuscript would benefit from an explicit table or section reference listing all evaluated tasks and full baseline scores for transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important omissions in the submitted manuscript. We address each point below and will revise the paper to improve transparency and reproducibility.
read point-by-point responses
-
Referee: [Corpus description] Corpus description (methods section): The paper provides no details on filtering, deduplication, or decontamination of Aurora-PT, nor any overlap statistics with the PLUE and ASSIN 2 test sets. Since the corpus is assembled from diverse web sources, the absence of these checks leaves open the possibility of benchmark contamination, which directly undermines the central claim that the reported deltas (e.g., MRPC F1 0.9191, RTE acc 0.7689, ASSIN 2 entailment F1 ~0.904) demonstrate superior model quality.
Authors: We acknowledge that the methods section omitted a full account of the Aurora-PT preprocessing pipeline. The corpus construction involved language identification, quality filtering, and basic deduplication, but these steps were not described in detail in the initial submission. In the revised manuscript we will add a dedicated subsection specifying the exact filtering criteria, MinHash-based deduplication, and n-gram decontamination procedure used to reduce overlap with evaluation sets. We will also report the measured overlap statistics (token-level and document-level) with the PLUE and ASSIN 2 test sets. These additions will directly address the contamination concern and support the validity of the reported performance differences. revision: yes
-
Referee: [Experimental setup] Experimental setup (results and methods): No training hyperparameters, fine-tuning protocols, statistical significance tests, or error analysis are reported. This prevents verification of the claim that NorBERTo-large sets new encoder SOTA on the cited tasks and makes the comparisons to baselines (Albertina-900M, BERTimbau-large) non-reproducible.
Authors: The referee is correct that the current manuscript lacks the experimental details required for reproducibility. We will expand the methods and results sections to include the complete pretraining hyperparameters (learning rate, batch size, optimizer, number of steps, sequence length, and hardware), the precise fine-tuning protocols for each task (including learning rates, batch sizes, epochs, and early-stopping criteria), results of statistical significance tests across multiple random seeds, and a concise error analysis for the primary tasks. These revisions will allow readers to reproduce the comparisons and verify the SOTA claims. revision: yes
Circularity Check
No circularity: purely empirical corpus curation, model training, and benchmarking
full rationale
The paper presents the construction of Aurora-PT (331B tokens from web sources and multilingual datasets) and the training of NorBERTo (ModernBERT-based encoder) on it, followed by direct benchmarking against baselines on external datasets (PLUE, ASSIN 2, MRPC, RTE). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. All claims rest on observable training outcomes and standardized external evaluations rather than any self-referential reduction or ansatz smuggled via prior work. This is self-contained empirical NLP work with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
free parameters (1)
- Training hyperparameters and model configuration
axioms (1)
- domain assumption ModernBERT architecture transfers effectively to Portuguese when trained on sufficient data
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2312.17704
Tupy-e: detecting hate speech in brazilian portuguese social media with a novel dataset and comprehensive analysis of models. arXiv preprint arXiv:2312.17704. Guilherme Penedo, Hynek Kydlí ˇcek, Loubna Ben al- lal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. 2024. The fineweb datasets: Decanting the web for the fines...
-
[2]
In International Conference on Computational Pro- cessing of the Portuguese Language , pages 406–412
The assin 2 shared task: a quick overview. In International Conference on Computational Pro- cessing of the Portuguese Language , pages 406–412. Springer. Brian Richards. 1987. Type/token ratios: what do they really tell us? Journal of Child Language , 14(2):201–209. João Rodrigues, Luís Gomes, and et al. 2023. Advanc- ing neural encoding of portuguese wi...
-
[3]
In Brazilian Conference on Intelli- gent Systems, pages 403–417
Bertimbau: Pretrained bert models for brazil- ian portuguese. In Brazilian Conference on Intelli- gent Systems, pages 403–417. Mildred C. Templin. 1957. Certain language skills in children: Their development and interrelationships . University of Minnesota Press, Minneapolis, MN. Jean Ure. 1971. Lexical density and register differentia- tion. Applications...
1957
-
[4]
Hatebr: A large expert annotated corpus of brazilian instagram comments for offensive language and hate speech detection. In Proceedings of the Thirteenth Language Resources and Evaluation Con- ference, pages 7174–7183. Ashish V aswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Atten...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.