The MultiBERTs: BERT Reproductions for Robustness Analysis

Alexander D'Amour; Dipanjan Das; Ellie Pavlick; Ian Tenney; Iulia Turc; Jacob Eisenstein; Jasmijn Bastings; Jason Wei; Naomi Saphra; Steve Yadlowsky

arxiv: 2106.16163 · v2 · pith:FDFGYQPNnew · submitted 2021-06-30 · 💻 cs.CL

The MultiBERTs: BERT Reproductions for Robustness Analysis

Thibault Sellam , Steve Yadlowsky , Jason Wei , Naomi Saphra , Alexander D'Amour , Tal Linzen , Jasmijn Bastings , Iulia Turc

show 4 more authors

Jacob Eisenstein Dipanjan Das Ian Tenney Ellie Pavlick

This is my paper

classification 💻 cs.CL

keywords bertdatamodelscheckpointcheckpointsconclusionsinitializationmodel

0 comments

read the original abstract

Experiments with pre-trained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact tested in the experiment (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure which includes the architecture, training data, initialization scheme, and loss function. Recent work has shown that repeating the pre-training process can lead to substantially different performance, suggesting that an alternate strategy is needed to make principled statements about procedures. To enable researchers to draw more robust conclusions, we introduce the MultiBERTs, a set of 25 BERT-Base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random weight initialization and shuffling of training data. We also define the Multi-Bootstrap, a non-parametric bootstrap method for statistical inference designed for settings where there are multiple pre-trained models and limited test data. To illustrate our approach, we present a case study of gender bias in coreference resolution, in which the Multi-Bootstrap lets us measure effects that may not be detected with a single checkpoint. We release our models and statistical library along with an additional set of 140 intermediate checkpoints captured during pre-training to facilitate research on learning dynamics.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness
cs.LG 2026-06 unverdicted novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.