moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT

Giovana Kerche Bon\'as; Jo\~ao Guilherme Alves Santos; Thales Sales Almeida; Thiago Laitz

arxiv: 2606.22722 · v1 · pith:PVFJYNQYnew · submitted 2026-06-21 · 💻 cs.CL

moBERTo: A Modern Encoder for Portuguese via Continued Pretraining of ModernBERT

Thiago Laitz , Thales Sales Almeida , Jo\~ao Guilherme Alves Santos , Giovana Kerche Bon\'as This is my paper

Pith reviewed 2026-06-26 10:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords Portuguese NLPencoder modelscontinued pretraininginformation retrievalnamed entity recognitionlanguage model adaptationlong-context modeling

0 comments

The pith

Continued pretraining of ModernBERT on a large Portuguese corpus creates an encoder that leads on retrieval and NLU benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that adapting an existing English encoder model through further training on Portuguese text produces competitive results for information retrieval, document classification, named entity recognition, and natural language understanding. It preserves the base architecture while testing variants that include a language-specific tokenizer and extended context training. A sympathetic reader would care because encoder models continue to serve production pipelines, and this method shows a lower-cost route to language-specific versions than training entirely new models. The ablations indicate continued pretraining maintains long-context strengths more reliably than starting from random weights.

Core claim

moBERTo is obtained by continued pretraining of ModernBERT-base on 60 billion tokens drawn from a 12-billion-token Portuguese corpus filtered from FineWeb2 using educational and STEM classifiers. The strongest variant applies a Portuguese tokenizer with subword-matching embedding transfer plus a long-context post-training stage at 8192 tokens, delivering the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best scores on PLUE-PT. Ablations establish that continued pretraining outperforms from-scratch training for long-context retention, that tokenizer adaptation aids token-level tasks while sometimes harming retrieval, and that the extra long-context pha

What carries the argument

Continued pretraining of the ModernBERT checkpoint on a filtered Portuguese corpus, optionally combined with tokenizer adaptation and a dedicated long-context post-training phase.

If this is right

Continued pretraining preserves long-context capabilities more effectively than training from scratch.
Tokenizer adaptation improves results on token-level tasks such as NER while sometimes lowering long-context retrieval performance.
A separate long-context post-training phase at 8192 tokens further raises reranking and NER scores.
Encoder-only models remain competitive with larger decoder-only models on discriminative tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same continued-pretraining recipe could produce usable encoders for additional languages that possess comparable filtered web corpora.
Modern encoder architectures may transfer across languages more efficiently through targeted adaptation than through full retraining from random initialization.
Production systems needing both retrieval and classification in non-English settings could adopt this style of adaptation to reduce compute costs.

Load-bearing premise

The 12-billion-token Portuguese corpus is representative of real usage and the evaluation benchmarks measure practical performance without hidden biases or leakage.

What would settle it

A model trained from scratch on the identical corpus achieving equal or higher nDCG@10 on the three retrieval sets and higher PLUE-PT scores would undermine the claimed advantage of continued pretraining.

Figures

Figures reproduced from arXiv: 2606.22722 by Giovana Kerche Bon\'as, Jo\~ao Guilherme Alves Santos, Thales Sales Almeida, Thiago Laitz.

**Figure 1.** Figure 1: Overview of the moBERTo training and evaluation pipeline. We start from the original ModernBERT-base checkpoint and perform continued pretraining on 60B tokens of curated Portuguese data, followed by evaluation across multiple downstream tasks. filter the data using the educational and STEM classifiers from ClassiCC-PT, which have been shown to improve continued pretraining of LLMs [1,2]. The resulting cor… view at source ↗

read the original abstract

Encoder-only transformer models remain essential for production NLP pipelines. We introduce moBERTo, a Portuguese adaptation of ModernBERT obtained through continued pretraining of the ModernBERT-base checkpoint on 60 billion tokens (5 epochs over a 12-billion-token corpus curated from FineWeb2 and filtered with educational and STEM classifiers). We preserve the original architecture, including rotary positional embeddings, alternating local-global attention, flash attention, and unpadding. We evaluate moBERTo across information retrieval (including long-context retrieval at up to 8,192 tokens), document classification, named entity recognition, and natural language understanding. Our best variant, which combines a Portuguese tokenizer with subword-matching embedding transfer and long-context post-training, achieves the highest average reranking nDCG@10 across three Portuguese retrieval benchmarks and the best results on PLUE-PT. Through ablation studies, we show that (i) continued pretraining is strongly preferable to training from scratch, particularly for preserving long-context capabilities; (ii) tokenizer adaptation improves token-level tasks but degrades long-context retrieval; (iii) a dedicated long-context post-training phase at 8,192 tokens further improves reranking and NER; and (iv) encoder-only architectures remain competitive with larger decoder-only alternatives for discriminative tasks. We publicly release the model weights at https://huggingface.co/Tropic-AI/moBERTo and training data at https://huggingface.co/datasets/Tropic-AI/moberto-pretraining-dataset-c4-compatible on Hugging Face.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical Portuguese encoder from continued pretraining on ModernBERT, with clear ablations but light on corpus decontamination details.

read the letter

The main takeaway is that moBERTo gives the field a usable open Portuguese encoder model trained from the ModernBERT checkpoint, and the ablations make a reasonable case for why continued pretraining plus tokenizer swap plus long-context post-training beats the obvious baselines.

What stands out is the release of both weights and the 12B-token training corpus on Hugging Face. The paper shows continued pretraining preserves long-context behavior better than training from scratch, that a Portuguese tokenizer helps token-level tasks but hurts retrieval, and that an extra 8k-context phase lifts reranking and NER. Those comparisons are straightforward and the numbers line up with the claimed best average nDCG@10 on the three retrieval sets plus top PLUE-PT score.

The soft spots are mostly around the data side. The abstract mentions educational and STEM classifiers on FineWeb2 but gives no overlap statistics with the test sets, no threshold details, and no decontamination protocol. Without those, it is hard to rule out that some of the retrieval gains come from reduced leakage rather than genuine language adaptation. The evaluation section also lacks error bars or significance tests on the reported deltas, which makes the ablation claims harder to weigh.

This paper is for groups that need a strong Portuguese encoder right now or that want a worked example of adapting an existing modern encoder to another language. It is not advancing new methods, but the empirical recipe and the public artifacts are worth having. I would send it to peer review; the core claims are testable once the full methods and data splits are checked, and the resource itself has clear downstream value.

Referee Report

3 major / 2 minor

Summary. The paper introduces moBERTo, a Portuguese adaptation of ModernBERT-base obtained via continued pretraining on a 12-billion-token corpus curated from FineWeb2 and filtered using educational and STEM classifiers (5 epochs over 60 billion tokens total). The architecture is preserved, including rotary embeddings, alternating local-global attention, and flash attention. The best variant combines a Portuguese tokenizer, subword-matching embedding transfer, and long-context post-training at 8192 tokens. It reports the highest average reranking nDCG@10 on three Portuguese retrieval benchmarks plus best PLUE-PT results, with ablations showing continued pretraining outperforms from-scratch training, tokenizer effects on tasks, and benefits of long-context post-training. Model weights and the pretraining dataset are released publicly.

Significance. If the empirical claims hold after addressing data-quality details, the work provides a competitive open encoder for Portuguese NLP pipelines, demonstrates practical benefits of continued pretraining for preserving long-context capabilities, and supplies reusable resources (model and 12B-token dataset). The ablations on tokenizer adaptation and long-context phases offer transferable insights for language-specific adaptation of modern encoders. The release of both weights and data strengthens reproducibility.

major comments (3)

[Data curation / corpus filtering] Data curation section: the manuscript provides no quantitative overlap statistics between the filtered 12B-token corpus and the three Portuguese retrieval test sets, nor any decontamination protocol. This is load-bearing for the central claim of highest nDCG@10, as leakage could artifactually inflate the reported gains.
[Corpus construction] Corpus filtering description: no details are given on the training data, architecture, or decision thresholds of the educational and STEM classifiers used to filter FineWeb2. Without these, it is impossible to assess whether the corpus is representative or whether filtering introduces unintended biases that affect downstream retrieval and PLUE-PT results.
[Ablation studies] Evaluation and ablation sections: the abstract and results claim continued pretraining is "strongly preferable" to from-scratch training, yet no hyperparameter budgets, data schedules, or statistical significance tests (with error bars) are reported for the from-scratch baseline, undermining the ablation's ability to support the headline performance claims.

minor comments (2)

[Abstract] The abstract states "60 billion tokens (5 epochs over a 12-billion-token corpus)"; clarify whether the 60B figure refers to total tokens seen or a different quantity, and ensure this is consistent in the methods section.
[Results tables] Table or figure reporting nDCG@10 averages should include per-benchmark breakdowns and variance estimates to allow readers to judge whether the "highest average" is driven by one outlier benchmark.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the revisions that will be incorporated to improve transparency and reproducibility.

read point-by-point responses

Referee: [Data curation / corpus filtering] Data curation section: the manuscript provides no quantitative overlap statistics between the filtered 12B-token corpus and the three Portuguese retrieval test sets, nor any decontamination protocol. This is load-bearing for the central claim of highest nDCG@10, as leakage could artifactually inflate the reported gains.

Authors: We agree this is a substantive concern for validating the retrieval results. The original manuscript omitted explicit overlap statistics and decontamination details. Since the full pretraining corpus is publicly released, we will add a new subsection reporting quantitative overlap percentages with each of the three Portuguese retrieval test sets along with the decontamination protocol used during curation. revision: yes
Referee: [Corpus construction] Corpus filtering description: no details are given on the training data, architecture, or decision thresholds of the educational and STEM classifiers used to filter FineWeb2. Without these, it is impossible to assess whether the corpus is representative or whether filtering introduces unintended biases that affect downstream retrieval and PLUE-PT results.

Authors: We acknowledge that the manuscript lacks sufficient detail on the classifiers. In the revised version we will expand the corpus construction section to specify the training data, model architectures, and exact decision thresholds employed for both the educational and STEM classifiers applied to FineWeb2. revision: yes
Referee: [Ablation studies] Evaluation and ablation sections: the abstract and results claim continued pretraining is "strongly preferable" to from-scratch training, yet no hyperparameter budgets, data schedules, or statistical significance tests (with error bars) are reported for the from-scratch baseline, undermining the ablation's ability to support the headline performance claims.

Authors: The from-scratch baseline was trained under a matched token budget and similar hyperparameter schedule to the continued-pretraining runs, but these details and statistical tests were not reported. We will revise the ablation section to include the hyperparameter budgets, data schedules, error bars, and significance tests comparing the from-scratch and continued-pretraining models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results only

full rationale

The paper reports continued pretraining of ModernBERT on a curated Portuguese corpus followed by direct empirical evaluation on retrieval (nDCG@10), PLUE-PT, classification, NER, and NLU benchmarks. Ablations compare tokenizer variants, long-context post-training, and from-scratch training via measured performance differences. No equations, first-principles derivations, or fitted-parameter predictions appear in the provided text. Central claims rest on benchmark scores rather than any self-definitional reduction, self-citation chain, or ansatz smuggled via prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review limits visibility into hyperparameters, data filtering details, and any unstated modeling choices; no explicit free parameters, axioms, or invented entities are described beyond standard continued pretraining assumptions.

pith-pipeline@v0.9.1-grok · 5818 in / 1131 out tokens · 28126 ms · 2026-06-26T10:05:53.403926+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 1 canonical work pages

[1]

arXiv preprint arXiv:2509.08824 (2025)

Almeida, T.S., Nogueira, R., Pedrini, H.: Building high-quality datasets for por- tuguese llms: From common crawl snapshots to industrial-grade corpora. arXiv preprint arXiv:2509.08824 (2025)

arXiv 2025
[2]

arXiv preprint arXiv:2512.12770 (2025)

Almeida, T.S., Nogueira, R., Pedrini, H.: Curi\’o-edu 7b: Examining data selection impacts in llm continued pretraining. arXiv preprint arXiv:2512.12770 (2025)

arXiv 2025
[3]

arXiv preprint arXiv:2411.08868 (2024)

Antoun, W., Kulumba, F., Touchent, R., de la Clergerie, É., Sagot, B., Seddah, D.: Camembert 2.0: A smarter french language model aged to perfection. arXiv preprint arXiv:2411.08868 (2024)

arXiv 2024
[4]

Boizard, N., Gisserot-Boukhlef, H., Alves, D.M., Martins, A., Hammal, A., Corro, C., Hudelot, C., Malherbe, E., Malaboeuf, E., Jourdan, F., Hautreux, G., Alves, J., El-Haddad, K., Faysse, M., Peyrard, M., Guerreiro, N.M., Fernandes, P., Rei, R., Colombo, P.: EuroBERT: Scaling multilingual encoders for european languages (2025)

2025
[5]

Bonifacio, L., Jeronymo, V., Abonizio, H.Q., Campiotti, I., Fadaee, M., Lotufo, R., Nogueira, R.: mmarco: A multilingual version of the ms marco passage ranking dataset (2022),https://arxiv.org/abs/2108.13897 Title Suppressed Due to Excessive Length 13

arXiv 2022
[6]

arXiv preprint arXiv:2502.19587 (2025)

Breton, L.L., Fournier, Q., Mezouar, M.E., Morris, J.X., Chandar, S.: Neobert: A next-generation bert. arXiv preprint arXiv:2502.19587 (2025)

arXiv 2025
[7]

Bueno, M., de Oliveira, E.S., Nogueira, R., Lotufo, R.A., Pereira, J.A.: Quati: A brazilian portuguese information retrieval dataset from native speakers (2024), https://arxiv.org/abs/2404.06976

arXiv 2024
[8]

Cesconetto, L.: NeoBERTugues: A portuguese ModernBERT model.https:// huggingface.co/lorenzocc/NeoBERTugues(2026)

2026
[9]

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self- knowledge distillation (2024)

2024
[10]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 8440–8451 (2020)

2020
[11]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

2019
[12]

In: Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol

Garcia, E., Silva, N., Siqueira, F., Gomes, J., Albuquerque, H.O., Souza, E., Lima, E., de Carvalho, A.: Robertalexpt: A legal roberta model pretrained with dedu- plication for portuguese. In: Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 1. pp. 374–383 (2024)

2024
[13]

GOMES, J.R.S.: Plue: Portuguese language understanding evaluation.https:// github.com/ju-resplande/PLUE(2020)

2020
[14]

arXiv preprint arXiv:2111.09543 (2021)

He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra- style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

Pith/arXiv arXiv 2021
[15]

Jeronymo,V.,Nascimento,M.,Lotufo,R.,Nogueira,R.:mrobust04:Amultilingual version of the trec robust 2004 benchmark (2022),https://arxiv.org/abs/2209. 13738

2004
[16]

Le Breton, L., Fournier, Q., El Mezouar, M., Morris, J.X., Chandar, S.: Neobert: A next-generation BERT (2025)

2025
[17]

arXiv preprint arXiv:1907.11692 (2019)

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

Pith/arXiv arXiv 1907
[18]

In: International Conference on the Computational Processing of Por- tuguese (PROPOR)

Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: International Conference on the Computational Processing of Por- tuguese (PROPOR). pp. 313–323. Lecture Notes on Computer Science (LNCS), Springer, Canela, RS, Brazil (September 24-2...

2018
[19]

Marone, M., Weller, O., Fleshman, W., Yang, E., Lawrie, D., Van Durme, B.: mmbert: A modern multilingual encoder with annealed language learning (2025)

2025
[20]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Martin, L., Muller, B., Suarez, P.O., Dupont, Y., Romary, L., de La Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty french language model. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7203–7219 (2020)

2020
[21]

Semina: Ciências Exatas 14 T

Mazza Zago, R., Agnoletti dos Santos Pedotti, L.: BERTugues: A novel BERT transformer model pre-trained for brazilian portuguese. Semina: Ciências Exatas 14 T. Laitz et al. e Tecnológicas45, e50630 (2024).https://doi.org/10.5433/1679-0375.2024. v45.50630

work page doi:10.5433/1679-0375.2024 2024
[22]

Muennighoff,N.,Tazi,N.,Magne,L.,Reimers,N.:MTEB:Massivetextembedding benchmark (2022)

2022
[23]

Nguyen, D.Q., Nguyen, A.T.: Phobert: Pre-trained language models for viet- namese.In:Findingsoftheassociationforcomputationallinguistics:EMNLP2020. pp. 1037–1042 (2020)

2020
[24]

arXiv preprint arXiv:2506.20920 (2025)

Penedo,G.,Kydlíček,H.,Sabolčec,V.,Messmer,B.,Foroutan,N.,Kargaran,A.H., Raffel, C., Jaggi, M., Von Werra, L., Wolf, T.: Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920 (2025)

arXiv 2025
[25]

In: EPIA Conference on Artificial Intelligence

Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H.L., Osório, T.: Advancing neural encoding of portuguese with transformer albertina pt. In: EPIA Conference on Artificial Intelligence. pp. 441–453. Springer (2023)

2023
[26]

In: Melero, M., Sakti, S., Soria, C

Santos, R., Rodrigues, J., Gomes, L., Silva, J.R., Branco, A., Lopes Cardoso, H., Osório, T.F., Leite, B.: Fostering the ecosystem of open neural encoders for Portuguese with albertina PT* family. In: Melero, M., Sakti, S., Soria, C. (eds.) Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under- resourced Languages @ LREC-COLING 2024...

2024
[27]

In: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Mod- els

Scheible-Schmitt, R., He, H., Mendes, A.B.: Portbert: Navigating the depths of portuguese language models. In: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Mod- els. pp. 59–71 (2025)

2025
[28]

In: Proceedings of the 3rd clinical natural language processing workshop

Schneider, E.T.R., de Souza, J.V.A., Knafou, J., e Oliveira, L.E.S., Copara, J., Gumiel, Y.B., de Oliveira, L.F.A., Paraiso, E.C., Teodoro, D., Barra, C.M.C.M.: Biobertpt-a portuguese neural language model for clinical named entity recogni- tion. In: Proceedings of the 3rd clinical natural language processing workshop. pp. 65–72 (2020)

2020
[29]

In: Brazilian Confer- ence on Intelligent Systems

Silva, M.O., Oliveira, G.P., Costa, L.G., Pappa, G.L.: Govbert-br: A bert-based language model for brazilian portuguese governmental data. In: Brazilian Confer- ence on Intelligent Systems. pp. 19–32. Springer (2024)

2024
[30]

In: Brazilian conference on intelligent systems

Souza,F.,Nogueira,R.,Lotufo,R.:Bertimbau:pretrainedbertmodelsforbrazilian portuguese. In: Brazilian conference on intelligent systems. pp. 403–417. Springer (2020)

2020
[31]

Sugiura, I., Nakayama, K., Oda, Y.: llm-jp-modernbert: A ModernBERT model trained on a large-scale japanese corpus with long context length (2025)

2025
[32]

In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (2018)

Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brwac corpus: A new open resource for brazilian portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (2018)

2018
[33]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi- task benchmark and analysis platform for natural language understanding (2019), https://arxiv.org/abs/1804.07461

Pith/arXiv arXiv 2019
[34]

In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers)

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational ...

2025
[35]

Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., Van Durme, B.: Seq vs seq: An open suite of paired encoders and decoders (2025) Title Suppressed Due to Excessive Length 15

2025
[36]

In: Anais do XXII Encontro Nacional de Inteligência Artificial e Computa- cional.pp.2044–2055.SBC,PortoAlegre,RS,Brasil(2025).https://doi.org/10

Wu, W., Garcia, L.: Modbertbr: A modernbert-based model for brazilian por- tuguese. In: Anais do XXII Encontro Nacional de Inteligência Artificial e Computa- cional.pp.2044–2055.SBC,PortoAlegre,RS,Brasil(2025).https://doi.org/10. 5753/eniac.2025.14516,https://sol.sbc.org.br/index.php/eniac/article/ view/38875

arXiv 2044
[37]

Wunderle, J., Ehrmanntraut, A., Pfister, J., Jannidis, F., Hotho, A.: New en- codersforgermantrainedfromscratch:ComparingModernGBERTwithconverted LLM2Vec models (2025)

2025

[1] [1]

arXiv preprint arXiv:2509.08824 (2025)

Almeida, T.S., Nogueira, R., Pedrini, H.: Building high-quality datasets for por- tuguese llms: From common crawl snapshots to industrial-grade corpora. arXiv preprint arXiv:2509.08824 (2025)

arXiv 2025

[2] [2]

arXiv preprint arXiv:2512.12770 (2025)

Almeida, T.S., Nogueira, R., Pedrini, H.: Curi\’o-edu 7b: Examining data selection impacts in llm continued pretraining. arXiv preprint arXiv:2512.12770 (2025)

arXiv 2025

[3] [3]

arXiv preprint arXiv:2411.08868 (2024)

Antoun, W., Kulumba, F., Touchent, R., de la Clergerie, É., Sagot, B., Seddah, D.: Camembert 2.0: A smarter french language model aged to perfection. arXiv preprint arXiv:2411.08868 (2024)

arXiv 2024

[4] [4]

Boizard, N., Gisserot-Boukhlef, H., Alves, D.M., Martins, A., Hammal, A., Corro, C., Hudelot, C., Malherbe, E., Malaboeuf, E., Jourdan, F., Hautreux, G., Alves, J., El-Haddad, K., Faysse, M., Peyrard, M., Guerreiro, N.M., Fernandes, P., Rei, R., Colombo, P.: EuroBERT: Scaling multilingual encoders for european languages (2025)

2025

[5] [5]

Bonifacio, L., Jeronymo, V., Abonizio, H.Q., Campiotti, I., Fadaee, M., Lotufo, R., Nogueira, R.: mmarco: A multilingual version of the ms marco passage ranking dataset (2022),https://arxiv.org/abs/2108.13897 Title Suppressed Due to Excessive Length 13

arXiv 2022

[6] [6]

arXiv preprint arXiv:2502.19587 (2025)

Breton, L.L., Fournier, Q., Mezouar, M.E., Morris, J.X., Chandar, S.: Neobert: A next-generation bert. arXiv preprint arXiv:2502.19587 (2025)

arXiv 2025

[7] [7]

Bueno, M., de Oliveira, E.S., Nogueira, R., Lotufo, R.A., Pereira, J.A.: Quati: A brazilian portuguese information retrieval dataset from native speakers (2024), https://arxiv.org/abs/2404.06976

arXiv 2024

[8] [8]

Cesconetto, L.: NeoBERTugues: A portuguese ModernBERT model.https:// huggingface.co/lorenzocc/NeoBERTugues(2026)

2026

[9] [9]

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., Liu, Z.: Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self- knowledge distillation (2024)

2024

[10] [10]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., Stoyanov, V.: Unsupervised cross-lingual representation learning at scale. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 8440–8451 (2020)

2020

[11] [11]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

2019

[12] [12]

In: Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol

Garcia, E., Silva, N., Siqueira, F., Gomes, J., Albuquerque, H.O., Souza, E., Lima, E., de Carvalho, A.: Robertalexpt: A legal roberta model pretrained with dedu- plication for portuguese. In: Proceedings of the 16th International Conference on Computational Processing of Portuguese-Vol. 1. pp. 374–383 (2024)

2024

[13] [13]

GOMES, J.R.S.: Plue: Portuguese language understanding evaluation.https:// github.com/ju-resplande/PLUE(2020)

2020

[14] [14]

arXiv preprint arXiv:2111.09543 (2021)

He, P., Gao, J., Chen, W.: Debertav3: Improving deberta using electra- style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543 (2021)

Pith/arXiv arXiv 2021

[15] [15]

Jeronymo,V.,Nascimento,M.,Lotufo,R.,Nogueira,R.:mrobust04:Amultilingual version of the trec robust 2004 benchmark (2022),https://arxiv.org/abs/2209. 13738

2004

[16] [16]

Le Breton, L., Fournier, Q., El Mezouar, M., Morris, J.X., Chandar, S.: Neobert: A next-generation BERT (2025)

2025

[17] [17]

arXiv preprint arXiv:1907.11692 (2019)

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

Pith/arXiv arXiv 1907

[18] [18]

In: International Conference on the Computational Processing of Por- tuguese (PROPOR)

Luz de Araujo, P.H., de Campos, T.E., de Oliveira, R.R.R., Stauffer, M., Couto, S., Bermejo, P.: LeNER-Br: a dataset for named entity recognition in Brazilian legal text. In: International Conference on the Computational Processing of Por- tuguese (PROPOR). pp. 313–323. Lecture Notes on Computer Science (LNCS), Springer, Canela, RS, Brazil (September 24-2...

2018

[19] [19]

Marone, M., Weller, O., Fleshman, W., Yang, E., Lawrie, D., Van Durme, B.: mmbert: A modern multilingual encoder with annealed language learning (2025)

2025

[20] [20]

In: Proceedings of the 58th annual meeting of the association for computational linguistics

Martin, L., Muller, B., Suarez, P.O., Dupont, Y., Romary, L., de La Clergerie, É.V., Seddah, D., Sagot, B.: Camembert: a tasty french language model. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 7203–7219 (2020)

2020

[21] [21]

Semina: Ciências Exatas 14 T

Mazza Zago, R., Agnoletti dos Santos Pedotti, L.: BERTugues: A novel BERT transformer model pre-trained for brazilian portuguese. Semina: Ciências Exatas 14 T. Laitz et al. e Tecnológicas45, e50630 (2024).https://doi.org/10.5433/1679-0375.2024. v45.50630

work page doi:10.5433/1679-0375.2024 2024

[22] [22]

Muennighoff,N.,Tazi,N.,Magne,L.,Reimers,N.:MTEB:Massivetextembedding benchmark (2022)

2022

[23] [23]

Nguyen, D.Q., Nguyen, A.T.: Phobert: Pre-trained language models for viet- namese.In:Findingsoftheassociationforcomputationallinguistics:EMNLP2020. pp. 1037–1042 (2020)

2020

[24] [24]

arXiv preprint arXiv:2506.20920 (2025)

Penedo,G.,Kydlíček,H.,Sabolčec,V.,Messmer,B.,Foroutan,N.,Kargaran,A.H., Raffel, C., Jaggi, M., Von Werra, L., Wolf, T.: Fineweb2: One pipeline to scale them all–adapting pre-training data processing to every language. arXiv preprint arXiv:2506.20920 (2025)

arXiv 2025

[25] [25]

In: EPIA Conference on Artificial Intelligence

Rodrigues, J., Gomes, L., Silva, J., Branco, A., Santos, R., Cardoso, H.L., Osório, T.: Advancing neural encoding of portuguese with transformer albertina pt. In: EPIA Conference on Artificial Intelligence. pp. 441–453. Springer (2023)

2023

[26] [26]

In: Melero, M., Sakti, S., Soria, C

Santos, R., Rodrigues, J., Gomes, L., Silva, J.R., Branco, A., Lopes Cardoso, H., Osório, T.F., Leite, B.: Fostering the ecosystem of open neural encoders for Portuguese with albertina PT* family. In: Melero, M., Sakti, S., Soria, C. (eds.) Proceedings of the 3rd Annual Meeting of the Special Interest Group on Under- resourced Languages @ LREC-COLING 2024...

2024

[27] [27]

In: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Mod- els

Scheible-Schmitt, R., He, H., Mendes, A.B.: Portbert: Navigating the depths of portuguese language models. In: Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Mod- els. pp. 59–71 (2025)

2025

[28] [28]

In: Proceedings of the 3rd clinical natural language processing workshop

Schneider, E.T.R., de Souza, J.V.A., Knafou, J., e Oliveira, L.E.S., Copara, J., Gumiel, Y.B., de Oliveira, L.F.A., Paraiso, E.C., Teodoro, D., Barra, C.M.C.M.: Biobertpt-a portuguese neural language model for clinical named entity recogni- tion. In: Proceedings of the 3rd clinical natural language processing workshop. pp. 65–72 (2020)

2020

[29] [29]

In: Brazilian Confer- ence on Intelligent Systems

Silva, M.O., Oliveira, G.P., Costa, L.G., Pappa, G.L.: Govbert-br: A bert-based language model for brazilian portuguese governmental data. In: Brazilian Confer- ence on Intelligent Systems. pp. 19–32. Springer (2024)

2024

[30] [30]

In: Brazilian conference on intelligent systems

Souza,F.,Nogueira,R.,Lotufo,R.:Bertimbau:pretrainedbertmodelsforbrazilian portuguese. In: Brazilian conference on intelligent systems. pp. 403–417. Springer (2020)

2020

[31] [31]

Sugiura, I., Nakayama, K., Oda, Y.: llm-jp-modernbert: A ModernBERT model trained on a large-scale japanese corpus with long context length (2025)

2025

[32] [32]

In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (2018)

Wagner Filho, J.A., Wilkens, R., Idiart, M., Villavicencio, A.: The brwac corpus: A new open resource for brazilian portuguese. In: Proceedings of the eleventh international conference on language resources and evaluation (LREC 2018) (2018)

2018

[33] [33]

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: Glue: A multi- task benchmark and analysis platform for natural language understanding (2019), https://arxiv.org/abs/1804.07461

Pith/arXiv arXiv 2019

[34] [34]

In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers)

Warner, B., Chaffin, A., Clavié, B., Weller, O., Hallström, O., Taghadouini, S., Gallagher, A., Biswas, R., Ladhak, F., Aarsen, T., et al.: Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. In: Proceedings of the 63rd Annual Meeting of the As- sociation for Computational ...

2025

[35] [35]

Weller, O., Ricci, K., Marone, M., Chaffin, A., Lawrie, D., Van Durme, B.: Seq vs seq: An open suite of paired encoders and decoders (2025) Title Suppressed Due to Excessive Length 15

2025

[36] [36]

In: Anais do XXII Encontro Nacional de Inteligência Artificial e Computa- cional.pp.2044–2055.SBC,PortoAlegre,RS,Brasil(2025).https://doi.org/10

Wu, W., Garcia, L.: Modbertbr: A modernbert-based model for brazilian por- tuguese. In: Anais do XXII Encontro Nacional de Inteligência Artificial e Computa- cional.pp.2044–2055.SBC,PortoAlegre,RS,Brasil(2025).https://doi.org/10. 5753/eniac.2025.14516,https://sol.sbc.org.br/index.php/eniac/article/ view/38875

arXiv 2044

[37] [37]

Wunderle, J., Ehrmanntraut, A., Pfister, J., Jannidis, F., Hotho, A.: New en- codersforgermantrainedfromscratch:ComparingModernGBERTwithconverted LLM2Vec models (2025)

2025