pith. sign in

arxiv: 2507.00994 · v4 · submitted 2025-07-01 · 💻 cs.CL

Should We Still Pretrain Encoders with Masked Language Modeling?

Pith reviewed 2026-05-19 06:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords encoder pretrainingmasked language modelingcausal language modelingbiphasic trainingtext representationsNLP benchmarkspretraining objectives
0
0 comments X

The pith

A biphasic pretraining approach that applies causal language modeling first and masked language modeling second produces stronger encoders than either objective alone under the same compute budget.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether masked language modeling remains the best way to pretrain text encoders or if recent decoder models pretrained with causal language modeling offer a real advantage. Through large-scale controlled experiments training dozens of models from 210 million to 1 billion parameters, the authors find that masked language modeling generally delivers higher final performance on representation benchmarks. However, causal language modeling proves more data-efficient and produces models that fine-tune more stably. The key result is that a two-phase schedule starting with causal language modeling and switching to masked language modeling reaches the best overall quality when total training tokens are held fixed. This two-phase route also becomes especially practical when one can start from an already-trained causal language model, cutting the extra cost needed to reach top encoder performance.

Core claim

While masked language modeling alone produces stronger final encoders than causal language modeling alone, a sequential schedule that first trains with causal language modeling and then switches to masked language modeling yields the highest performance across text representation tasks when the total number of training tokens is fixed. Models trained this way also inherit the data efficiency and fine-tuning stability advantages of causal language modeling, and the advantage grows when the first phase begins from an existing pretrained causal language model rather than random initialization.

What carries the argument

Biphasic training schedule that applies causal language modeling followed by masked language modeling on the same data and model size.

If this is right

  • Under a fixed token budget, encoders reach higher quality when the pretraining objective changes from causal to masked midway through training.
  • Starting the first phase from an already-trained causal language model reduces the additional tokens needed to match or exceed pure masked language modeling encoders.
  • Causal language modeling pretraining produces representations that are more stable during subsequent fine-tuning on downstream tasks.
  • Data efficiency gains from causal language modeling persist even when the model later switches to masked language modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the biphasic schedule generalizes, many existing causal language models could be cheaply converted into strong encoders by adding a shorter masked language modeling phase rather than training from scratch.
  • The stability advantage of causal language modeling may reduce the need for extensive hyperparameter tuning when adapting encoders to new tasks.
  • Future encoder work could explore whether other objective switches, such as adding denoising or contrastive phases, produce similar compounding gains.

Load-bearing premise

The controlled ablations isolate the pretraining objective from differences in data ordering, optimizer settings, or evaluation choices.

What would settle it

Retrain the same model sizes on the same data with the biphasic schedule but swap the order to masked language modeling first followed by causal language modeling and measure whether final benchmark scores drop below the reported CLM-then-MLM results.

Figures

Figures reproduced from arXiv: 2507.00994 by Andr\'e F. T. Martins, C\'eline Hudelot, Duarte M. Alves, Emmanuel Malherbe, Hippolyte Gisserot-Boukhlef, Manuel Faysse, Nicolas Boizard, Pierre Colombo.

Figure 1
Figure 1. Figure 1: Experimental setup overview and key results on sequence classification (610M model size, 40% [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MLM vs. CLM downstream performance, averaged across tasks and reported for all model sizes. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Task-wise downstream performance across different masking ratios for all model sizes. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Downstream performance as a function of pretraining steps for CLM and MLM objectives. Results [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Impact of the fine-tuning learning rate on MLM- vs. CLM-pretrained models. Error bars indicate [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of two-stage CLM+MLM pretraining on downstream performance under different training [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of downstream performance variability across different masking ratios (20%, 30%, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Impact of performing MLM CPT on either CLM- or MLM-pretrained models (denoted as Base). [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MLM loss curves for CLM- and MLM-pretrained models across the 3 CPT compute budgets (2,000, [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Downstream performance as a function of CPT length for CLM- and MLM-pretrained models, [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
read the original abstract

Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper conducts large-scale controlled ablations (38 models, 210M–1B parameters, >15k fine-tuning runs) comparing MLM and CLM objectives for encoder pretraining. It reports that MLM generally outperforms CLM on representation benchmarks, while CLM is more data-efficient and stable during fine-tuning; a biphasic CLM-then-MLM schedule under fixed token budget yields the best results, with further gains when initializing from public CLM checkpoints.

Significance. If the empirical findings hold, the work offers a practical recipe for training high-quality encoders more efficiently by leveraging existing CLM models and a simple biphasic schedule. The scale of the controlled experiments and public release of all artifacts constitute clear strengths that could influence pretraining practice.

major comments (1)
  1. [§4.2] §4.2 (Ablation controls): The central claim that objective choice is isolated from confounders rests on the assertion of identical data, model sizes, and total tokens. Explicit confirmation is needed that learning-rate schedules, optimizer states, and data-ordering were matched exactly across MLM and CLM runs; any residual mismatch would undermine the data-efficiency and biphasic-superiority conclusions.
minor comments (3)
  1. [Table 3] Table 3: Report standard deviations or results from at least three random seeds for the key biphasic vs. single-objective comparisons so readers can assess whether the reported gains are robust.
  2. [§5.1] §5.1: The transition point in the biphasic schedule (number of CLM tokens before switching to MLM) is described narratively; adding an equation or pseudocode would make the exact protocol reproducible.
  3. [Figure 2] Figure 2: Axis labels and legend text are small; increasing font size would improve readability of the scaling curves.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for recognizing the value of our large-scale controlled experiments. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of our ablation controls.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Ablation controls): The central claim that objective choice is isolated from confounders rests on the assertion of identical data, model sizes, and total tokens. Explicit confirmation is needed that learning-rate schedules, optimizer states, and data-ordering were matched exactly across MLM and CLM runs; any residual mismatch would undermine the data-efficiency and biphasic-superiority conclusions.

    Authors: We agree that explicit confirmation strengthens the validity of our claims. All MLM and CLM runs were conducted with identical hyperparameters: the same learning-rate schedule (linear warmup followed by cosine decay with matching peak LR, warmup steps, and total steps), the same AdamW optimizer (identical betas, epsilon, and weight decay), and the same data ordering (identical shuffling seed and data loader configuration). These controls are already described in the experimental setup, but we will add a dedicated paragraph in §4.2 explicitly stating that these factors were matched exactly across objectives. This revision will not alter any results or conclusions. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports purely empirical results from controlled pretraining ablations on 38 models (210M–1B parameters) with matched data, token budgets, and evaluation protocols. The biphasic CLM-then-MLM strategy is presented as an experimental outcome measured on held-out tasks rather than derived from equations or first-principles arguments. No load-bearing derivations, fitted-parameter predictions, or self-citation chains appear; all claims remain directly falsifiable by replicating the reported training runs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is purely empirical. No new mathematical axioms or invented entities are introduced. The only background assumptions are standard supervised fine-tuning practices and the representativeness of the chosen downstream benchmarks.

pith-pipeline@v0.9.0 · 5801 in / 1211 out tokens · 28982 ms · 2026-05-19T06:28:09.822161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Causal Language Modeling Detour Improves Encoder Continued Pretraining

    cs.CL 2026-05 conditional novelty 7.0

    A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    The Falcon Series of Open Language Models

    Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023

  2. [2]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016. URL https://arxiv.org/abs/1611.09268

  3. [3]

    Llm2vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=IW1PR7vEBf#discussion

  4. [4]

    PaliGemma: A versatile 3B VLM for transfer

    Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...

  5. [5]

    Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, and Pierre Colombo. Eurobert: Scaling multilingual enco...

  6. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64...

  7. [7]

    M 3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 2318--2335, Bangkok, Tha...

  8. [8]

    Electra: Pre-training text encoders as discriminators rather than generators

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. In The Eighth International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB

  9. [9]

    Context is gold to find the gold passage: Evaluating and training contextual document embeddings

    Max Conti, Manuel Faysse, Gautier Viaud, Antoine Bosselut, C \'e line Hudelot, and Pierre Colombo. Context is gold to find the gold passage: Evaluating and training contextual document embeddings. arXiv preprint arXiv:2505.24782, 2025

  10. [10]

    BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...

  11. [11]

    Smith , title =

    Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020. URL https://arxiv.org/abs/2002.06305

  12. [12]

    Mmteb: Massive multilingual text embedding benchmark

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M \'a rton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi \'n ski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=zl3pfz4VCV

  13. [13]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783

  14. [14]

    Late chunking: Contextual chunk embeddings using long-context embedding models, 2024

    Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. Late chunking: Contextual chunk embeddings using long-context embedding models, 2024. URL https://arxiv.org/abs/2409.04701

  15. [15]

    Deberta: Decoding-enhanced bert with disentangled attention

    Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In The Ninth International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XPZIaotutsD

  16. [16]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556

  17. [17]

    O nto N otes: The 90 \ In Robert C

    Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. O nto N otes: The 90 \ In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Short Papers , pp.\ 57--60, New York City, USA, June 2006. Association for Comput...

  18. [18]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...

  19. [19]

    Colbert: Efficient and effective passage search via contextualized late interaction over bert, 2020

    Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert, 2020. URL https://arxiv.org/abs/2004.12832

  20. [20]

    Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement

    Jihoon Kwon Sangmo Gu Yejin Kim, Minkyung Cho Jy-yong Sohn Chanyeol, Choi Junseong Kim, and Seolhwa Lee. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog, 2024

  21. [21]

    Kopiczko, Tijmen Blankevoort, and Yuki M

    Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Bitune: Bidirectional instruction-tuning, 2024

  22. [22]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

  23. [23]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. URL https://arxiv.org/abs/2405.17428

  24. [24]

    Mixout: Effective regularization to finetune large-scale pretrained language models

    Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to finetune large-scale pretrained language models. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1909.11299

  25. [25]

    Gemini Embedding: Generalizable Embeddings from Gemini

    Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hern \'a ndez \'A brego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini. arXiv preprint arXiv:2503.07891, 2025. URL https://arxiv.org/abs/2503.07891

  26. [26]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692

  27. [27]

    Karlsson, Peiqin Lin, Nikola Ljube s i \'c , Nikola Ljube s i \'c , LJ Miranda, Barbara Plank, Arij Riabi, and Yuval Pinter

    Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Suppa, Hila Gonen, Joseph Marvin Imperial, B \"o rje F. Karlsson, Peiqin Lin, Nikola Ljube s i \'c , Nikola Ljube s i \'c , LJ Miranda, Barbara Plank, Arij Riabi, and Yuval Pinter. Universal NER : A gold-standard multilingual named entity recognition benchmark. In Kevin Duh, Helena Gomez, and Steven Bethar...

  28. [28]

    Sfrembedding-mistral: enhance text retrieval with transfer learning

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3: 0 6, 2024. URL https://www.salesforce.com/blog/sfr-embedding/

  29. [29]

    arXiv preprint arXiv:2202.08904 , year=

    Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022. URL https://arxiv.org/abs/2202.08904

  30. [30]

    MTEB : Massive text embedding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

  31. [31]

    Generative representational instruction tuning

    Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In ICLR 2024 Workshop: How Far Are We From AGI, 2024. URL https://arxiv.org/abs/2402.09906

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. URL https://arxiv.org/abs/1807.03748

  33. [33]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557

  34. [34]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683

  35. [35]

    SQ u AD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistic...

  36. [36]

    In: Gurevych, I., Miyao, Y

    Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don ' t know: Unanswerable questions for SQ u AD . In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 784--789, Melbourne, Australia, July 2018. Association for Computational Linguistics...

  37. [37]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Langu...

  38. [38]

    Repetition improves language model embeddings,

    Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings, 2024. URL https://arxiv.org/abs/2402.15449

  39. [39]

    Tjong Kim Sang and Fien De Meulder

    Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the C o NLL -2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT - NAACL 2003 , pp.\ 142--147, 2003. URL https://aclanthology.org/W03-0419/

  40. [40]

    Proceedings of the 2018

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa a, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 353--355, ...

  41. [41]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volu...

  42. [42]

    Bilateral Multi-Perspective Matching for Natural Language Sentences

    Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017. URL https://arxiv.org/abs/1702.03814

  43. [43]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 20...

  44. [44]

    Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15 \ In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2985--3000, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.eacl-...

  45. [45]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\...

  46. [46]

    Revisiting few-sample bert fine-tuning

    Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi. Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987, 2020. URL https://arxiv.org/abs/2006.05987

  47. [47]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025. URL https://arxiv.org/abs/2506.05176

  48. [48]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...