Should We Still Pretrain Encoders with Masked Language Modeling?
Pith reviewed 2026-05-19 06:28 UTC · model grok-4.3
The pith
A biphasic pretraining approach that applies causal language modeling first and masked language modeling second produces stronger encoders than either objective alone under the same compute budget.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
While masked language modeling alone produces stronger final encoders than causal language modeling alone, a sequential schedule that first trains with causal language modeling and then switches to masked language modeling yields the highest performance across text representation tasks when the total number of training tokens is fixed. Models trained this way also inherit the data efficiency and fine-tuning stability advantages of causal language modeling, and the advantage grows when the first phase begins from an existing pretrained causal language model rather than random initialization.
What carries the argument
Biphasic training schedule that applies causal language modeling followed by masked language modeling on the same data and model size.
If this is right
- Under a fixed token budget, encoders reach higher quality when the pretraining objective changes from causal to masked midway through training.
- Starting the first phase from an already-trained causal language model reduces the additional tokens needed to match or exceed pure masked language modeling encoders.
- Causal language modeling pretraining produces representations that are more stable during subsequent fine-tuning on downstream tasks.
- Data efficiency gains from causal language modeling persist even when the model later switches to masked language modeling.
Where Pith is reading between the lines
- If the biphasic schedule generalizes, many existing causal language models could be cheaply converted into strong encoders by adding a shorter masked language modeling phase rather than training from scratch.
- The stability advantage of causal language modeling may reduce the need for extensive hyperparameter tuning when adapting encoders to new tasks.
- Future encoder work could explore whether other objective switches, such as adding denoising or contrastive phases, produce similar compounding gains.
Load-bearing premise
The controlled ablations isolate the pretraining objective from differences in data ordering, optimizer settings, or evaluation choices.
What would settle it
Retrain the same model sizes on the same data with the biphasic schedule but swap the order to masked language modeling first followed by causal language modeling and measure whether final benchmark scores drop below the reported CLM-then-MLM results.
Figures
read the original abstract
Learning high-quality text representations is fundamental to a wide range of NLP tasks. While encoder pretraining has traditionally relied on Masked Language Modeling (MLM), recent evidence suggests that decoder models pretrained with Causal Language Modeling (CLM) can be effectively repurposed as encoders, often surpassing traditional encoders on text representation benchmarks. However, it remains unclear whether these gains reflect an inherent advantage of the CLM objective or arise from confounding factors such as model and data scale. In this paper, we address this question through a series of large-scale, carefully controlled pretraining ablations, training a total of 38 models ranging from 210 million to 1 billion parameters, and conducting over 15,000 fine-tuning and evaluation runs. We find that while training with MLM generally yields better performance across text representation tasks, CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability. Building on these findings, we experimentally show that a biphasic training strategy that sequentially applies CLM and then MLM, achieves optimal performance under a fixed computational training budget. Moreover, we demonstrate that this strategy becomes more appealing when initializing from readily available pretrained CLM models, reducing the computational burden needed to train best-in-class encoder models. We release all project artifacts at https://hf.co/MLMvsCLM to foster further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts large-scale controlled ablations (38 models, 210M–1B parameters, >15k fine-tuning runs) comparing MLM and CLM objectives for encoder pretraining. It reports that MLM generally outperforms CLM on representation benchmarks, while CLM is more data-efficient and stable during fine-tuning; a biphasic CLM-then-MLM schedule under fixed token budget yields the best results, with further gains when initializing from public CLM checkpoints.
Significance. If the empirical findings hold, the work offers a practical recipe for training high-quality encoders more efficiently by leveraging existing CLM models and a simple biphasic schedule. The scale of the controlled experiments and public release of all artifacts constitute clear strengths that could influence pretraining practice.
major comments (1)
- [§4.2] §4.2 (Ablation controls): The central claim that objective choice is isolated from confounders rests on the assertion of identical data, model sizes, and total tokens. Explicit confirmation is needed that learning-rate schedules, optimizer states, and data-ordering were matched exactly across MLM and CLM runs; any residual mismatch would undermine the data-efficiency and biphasic-superiority conclusions.
minor comments (3)
- [Table 3] Table 3: Report standard deviations or results from at least three random seeds for the key biphasic vs. single-objective comparisons so readers can assess whether the reported gains are robust.
- [§5.1] §5.1: The transition point in the biphasic schedule (number of CLM tokens before switching to MLM) is described narratively; adding an equation or pseudocode would make the exact protocol reproducible.
- [Figure 2] Figure 2: Axis labels and legend text are small; increasing font size would improve readability of the scaling curves.
Simulated Author's Rebuttal
We thank the referee for their positive recommendation of minor revision and for recognizing the value of our large-scale controlled experiments. We address the single major comment below and will revise the manuscript accordingly to strengthen the presentation of our ablation controls.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Ablation controls): The central claim that objective choice is isolated from confounders rests on the assertion of identical data, model sizes, and total tokens. Explicit confirmation is needed that learning-rate schedules, optimizer states, and data-ordering were matched exactly across MLM and CLM runs; any residual mismatch would undermine the data-efficiency and biphasic-superiority conclusions.
Authors: We agree that explicit confirmation strengthens the validity of our claims. All MLM and CLM runs were conducted with identical hyperparameters: the same learning-rate schedule (linear warmup followed by cosine decay with matching peak LR, warmup steps, and total steps), the same AdamW optimizer (identical betas, epsilon, and weight decay), and the same data ordering (identical shuffling seed and data loader configuration). These controls are already described in the experimental setup, but we will add a dedicated paragraph in §4.2 explicitly stating that these factors were matched exactly across objectives. This revision will not alter any results or conclusions. revision: yes
Circularity Check
No significant circularity
full rationale
The paper reports purely empirical results from controlled pretraining ablations on 38 models (210M–1B parameters) with matched data, token budgets, and evaluation protocols. The biphasic CLM-then-MLM strategy is presented as an experimental outcome measured on held-out tasks rather than derived from equations or first-principles arguments. No load-bearing derivations, fitted-parameter predictions, or self-citation chains appear; all claims remain directly falsifiable by replicating the reported training runs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a biphasic training strategy that sequentially applies CLM and then MLM achieves optimal performance under a fixed computational training budget
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CLM-trained models are more data-efficient and demonstrate improved fine-tuning stability
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
A Causal Language Modeling Detour Improves Encoder Continued Pretraining
A temporary CLM phase followed by MLM decay during encoder continued pretraining outperforms standard MLM on biomedical tasks by 0.3-2.8pp across languages and model sizes.
Reference graph
Works this paper leans on
-
[1]
The Falcon Series of Open Language Models
Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M \'e rouane Debbah, \'E tienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, et al. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016. URL https://arxiv.org/abs/1611.09268
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Llm2vec: Large language models are secretly powerful text encoders
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=IW1PR7vEBf#discussion
work page 2024
-
[4]
PaliGemma: A versatile 3B VLM for transfer
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, Thomas Unterthiner, Daniel Keysers, Skanda Koppula, Fangyu Liu, Adam Grycner, Alexey Gritsenko, Neil Houlsby, Manoj Kumar, Keran Rong, Julian Eisenschlos, Rishabh Kabra, Matthias Bau...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Nicolas Boizard, Hippolyte Gisserot-Boukhlef, Duarte M. Alves, André Martins, Ayoub Hammal, Caio Corro, Céline Hudelot, Emmanuel Malherbe, Etienne Malaboeuf, Fanny Jourdan, Gabriel Hautreux, João Alves, Kevin El-Haddad, Manuel Faysse, Maxime Peyrard, Nuno M. Guerreiro, Patrick Fernandes, Ricardo Rei, and Pierre Colombo. Eurobert: Scaling multilingual enco...
-
[6]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64...
work page 1901
-
[7]
Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. M 3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 2318--2335, Bangkok, Tha...
-
[8]
Electra: Pre-training text encoders as discriminators rather than generators
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. In The Eighth International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB
work page 2020
-
[9]
Context is gold to find the gold passage: Evaluating and training contextual document embeddings
Max Conti, Manuel Faysse, Gautier Viaud, Antoine Bosselut, C \'e line Hudelot, and Pierre Colombo. Context is gold to find the gold passage: Evaluating and training contextual document embeddings. arXiv preprint arXiv:2505.24782, 2025
-
[10]
BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT : Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol...
-
[11]
Jesse Dodge, Gabriel Ilharco, Roy Schwartz, Ali Farhadi, Hannaneh Hajishirzi, and Noah Smith. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305, 2020. URL https://arxiv.org/abs/2002.06305
-
[12]
Mmteb: Massive multilingual text embedding benchmark
Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M \'a rton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi \'n ski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=zl3pfz4VCV
work page 2025
-
[13]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Late chunking: Contextual chunk embeddings using long-context embedding models, 2024
Michael Günther, Isabelle Mohr, Daniel James Williams, Bo Wang, and Han Xiao. Late chunking: Contextual chunk embeddings using long-context embedding models, 2024. URL https://arxiv.org/abs/2409.04701
-
[15]
Deberta: Decoding-enhanced bert with disentangled attention
Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. Deberta: Decoding-enhanced bert with disentangled attention. In The Ninth International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=XPZIaotutsD
work page 2021
-
[16]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022. URL https://arxiv.org/abs/2203.15556
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
O nto N otes: The 90 \ In Robert C
Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. O nto N otes: The 90 \ In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL , Companion Volume: Short Papers , pp.\ 57--60, New York City, USA, June 2006. Association for Comput...
work page 2006
-
[18]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
Colbert: Efficient and effective passage search via contextualized late interaction over bert, 2020
Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert, 2020. URL https://arxiv.org/abs/2004.12832
-
[20]
Jihoon Kwon Sangmo Gu Yejin Kim, Minkyung Cho Jy-yong Sohn Chanyeol, Choi Junseong Kim, and Seolhwa Lee. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog, 2024
work page 2024
-
[21]
Kopiczko, Tijmen Blankevoort, and Yuki M
Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Bitune: Bidirectional instruction-tuning, 2024
work page 2024
-
[22]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...
-
[23]
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024. URL https://arxiv.org/abs/2405.17428
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Mixout: Effective regularization to finetune large-scale pretrained language models
Cheolhyoung Lee, Kyunghyun Cho, and Wanmo Kang. Mixout: Effective regularization to finetune large-scale pretrained language models. In International Conference on Learning Representations, 2020. URL https://arxiv.org/abs/1909.11299
-
[25]
Gemini Embedding: Generalizable Embeddings from Gemini
Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hern \'a ndez \'A brego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, et al. Gemini embedding: Generalizable embeddings from gemini. arXiv preprint arXiv:2503.07891, 2025. URL https://arxiv.org/abs/2503.07891
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019. URL https://arxiv.org/abs/1907.11692
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[27]
Stephen Mayhew, Terra Blevins, Shuheng Liu, Marek Suppa, Hila Gonen, Joseph Marvin Imperial, B \"o rje F. Karlsson, Peiqin Lin, Nikola Ljube s i \'c , Nikola Ljube s i \'c , LJ Miranda, Barbara Plank, Arij Riabi, and Yuval Pinter. Universal NER : A gold-standard multilingual named entity recognition benchmark. In Kevin Duh, Helena Gomez, and Steven Bethar...
-
[28]
Sfrembedding-mistral: enhance text retrieval with transfer learning
Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3: 0 6, 2024. URL https://www.salesforce.com/blog/sfr-embedding/
work page 2024
-
[29]
arXiv preprint arXiv:2202.08904 , year=
Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022. URL https://arxiv.org/abs/2202.08904
-
[30]
MTEB : Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...
-
[31]
Generative representational instruction tuning
Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. In ICLR 2024 Workshop: How Far Are We From AGI, 2024. URL https://arxiv.org/abs/2402.09906
-
[32]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. URL https://arxiv.org/abs/1807.03748
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
Guilherme Penedo, Hynek Kydlíček, Loubna Ben allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, and Thomas Wolf. The fineweb datasets: Decanting the web for the finest text data at scale, 2024. URL https://arxiv.org/abs/2406.17557
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
SQ u AD : 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392, Austin, Texas, November 2016. Association for Computational Linguistic...
-
[36]
Pranav Rajpurkar, Robin Jia, and Percy Liang. Know what you don ' t know: Unanswerable questions for SQ u AD . In Iryna Gurevych and Yusuke Miyao (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.\ 784--789, Melbourne, Australia, July 2018. Association for Computational Linguistics...
-
[37]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Langu...
work page 2013
-
[38]
Repetition improves language model embeddings,
Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Repetition improves language model embeddings, 2024. URL https://arxiv.org/abs/2402.15449
-
[39]
Tjong Kim Sang and Fien De Meulder
Erik F. Tjong Kim Sang and Fien De Meulder. Introduction to the C o NLL -2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT - NAACL 2003 , pp.\ 142--147, 2003. URL https://aclanthology.org/W03-0419/
work page 2003
-
[40]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupa a, and Afra Alishahi (eds.), Proceedings of the 2018 EMNLP Workshop B lackbox NLP : Analyzing and Interpreting Neural Networks for NLP , pp.\ 353--355, ...
-
[41]
Superglue: A stickier benchmark for general-purpose language understanding systems
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volu...
work page 2019
-
[42]
Bilateral Multi-Perspective Matching for Natural Language Sentences
Zhiguo Wang, Wael Hamza, and Radu Florian. Bilateral multi-perspective matching for natural language sentences. arXiv preprint arXiv:1702.03814, 2017. URL https://arxiv.org/abs/1702.03814
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, and Iacopo Poli. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference, 20...
work page internal anchor Pith review arXiv 2024
-
[44]
Alexander Wettig, Tianyu Gao, Zexuan Zhong, and Danqi Chen. Should you mask 15 \ In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2985--3000, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi:10.18653/v1/2023.eacl-...
-
[45]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
Adina Williams, Nikita Nangia, and Samuel Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pp.\...
-
[46]
Revisiting few-sample bert fine-tuning
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi. Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987, 2020. URL https://arxiv.org/abs/2006.05987
-
[47]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models. arXiv preprint arXiv:2506.05176, 2025. URL https://arxiv.org/abs/2506.05176
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.