How far can bias go? Tracing bias from pretraining data to alignment
Pith reviewed 2026-05-23 08:23 UTC · model grok-4.3
The pith
Biases present in pre-training data are amplified in LLM outputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Biases present in pre-training data are amplified in model outputs. Using zero-shot prompting and token co-occurrence analyses on the Dolma dataset and OLMo model, the study shows that gender-occupation stereotypes grow stronger in the model's responses. Instruction-tuning partially alleviates representational bias while still maintaining overall stereotypical gender associations, whereas changes in hyperparameters and prompt types have smaller effects.
What carries the argument
Correlation analysis between gender-occupation bias measured in the Dolma pre-training corpus and the same bias expressed in OLMo model outputs via zero-shot prompts and token co-occurrence statistics.
If this is right
- Instruction-tuning reduces some forms of representational bias but does not remove stereotypical gender-occupation links.
- Variation in prompt type or model hyperparameters has limited impact on the level of bias expressed.
- Mitigating bias at the pre-training data stage is necessary because downstream alignment only partially compensates.
- The same tracing method can be applied to other bias types and model families to check whether amplification is general.
Where Pith is reading between the lines
- If bias amplification occurs early, then filtering or re-weighting pre-training data could reduce downstream bias more effectively than post-hoc alignment techniques.
- The persistence of stereotypes after instruction-tuning raises the question of whether alignment objectives need explicit anti-stereotype terms.
- Extending the co-occurrence method to measure bias in intermediate model checkpoints could reveal at which training stage amplification begins.
Load-bearing premise
That zero-shot prompting and token co-occurrence analyses provide a faithful, unbiased measure of how pretraining bias manifests in model behavior without introducing their own artifacts or selection effects.
What would settle it
A direct count showing that the frequency or strength of gender-occupation stereotypes in OLMo outputs is equal to or lower than their frequency in the Dolma training data would falsify the amplification claim.
read the original abstract
As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the correlation between gender-occupation bias in the Dolma pre-training dataset and its manifestation in the OLMo model. Using token co-occurrence analysis on the data and zero-shot prompting on the model, it claims that pretraining biases are amplified in model outputs. It further studies the effects of prompt types, hyperparameters, and instruction-tuning, finding that instruction-tuning partially alleviates representational bias while preserving overall stereotypical gender associations.
Significance. If the empirical mappings hold, the work provides concrete evidence that bias transmission occurs from pretraining data to aligned model outputs in an open model-dataset pair, underscoring the need for mitigation at the data curation stage rather than solely post-alignment. The observational tracing through the pipeline is a strength for an empirical study in this area.
major comments (2)
- [Methods / Results (bias measurement)] The central amplification claim (abstract) rests on the assumption that token co-occurrence statistics in Dolma directly predict and are exceeded by bias measured via zero-shot prompts in OLMo. Without reported validation (e.g., comparison of co-occurrence to the model's actual next-token distributions on held-out data or ablation on prompt phrasing), differences may reflect metric artifacts or selection effects in occupation lists rather than faithful data-to-model transmission.
- [Results (instruction-tuning analysis)] The claim that instruction-tuning 'partially alleviating representational bias while still maintaining overall stereotypical gender associations' (abstract) requires effect-size reporting and controls: what is the magnitude of change in bias metrics pre- vs. post-tuning, and are results robust to multiple prompt templates or random seeds? The current description leaves open whether the partial alleviation is statistically reliable or driven by specific choices.
minor comments (2)
- [Methods] Clarify the exact definition of 'amplification' (e.g., ratio of bias scores, statistical test) and whether base-rate normalization was applied in the co-occurrence analysis.
- [Abstract / Introduction] The abstract states the study 'traces bias throughout the LLM development pipeline,' but the reported analyses focus on pretraining data and post-alignment outputs; specify whether intermediate stages (e.g., SFT data) were examined.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on methodological validation and statistical robustness. We address each major comment below and will incorporate revisions to strengthen the empirical claims.
read point-by-point responses
-
Referee: The central amplification claim (abstract) rests on the assumption that token co-occurrence statistics in Dolma directly predict and are exceeded by bias measured via zero-shot prompts in OLMo. Without reported validation (e.g., comparison of co-occurrence to the model's actual next-token distributions on held-out data or ablation on prompt phrasing), differences may reflect metric artifacts or selection effects in occupation lists rather than faithful data-to-model transmission.
Authors: We agree that direct validation would better support the transmission interpretation. Our current approach follows common practice in the bias literature by treating co-occurrence as a data-level proxy, but we will add (1) a comparison of occupation-gender co-occurrence rates against OLMo’s next-token log-probabilities on a held-out Dolma subset and (2) an ablation across three prompt phrasings to quantify sensitivity. These additions will be reported in a new subsection of the Methods and Results. revision: yes
-
Referee: The claim that instruction-tuning 'partially alleviating representational bias while still maintaining overall stereotypical gender associations' (abstract) requires effect-size reporting and controls: what is the magnitude of change in bias metrics pre- vs. post-tuning, and are results robust to multiple prompt templates or random seeds? The current description leaves open whether the partial alleviation is statistically reliable or driven by specific choices.
Authors: We will report Cohen’s d (or equivalent) for the pre- vs. post-instruction-tuning change in each bias metric. We will also rerun the instruction-tuning analysis across five distinct prompt templates and three random seeds, presenting means and standard deviations to demonstrate robustness. These results will be added to the instruction-tuning subsection. revision: yes
Circularity Check
No circularity: purely observational empirical measurements
full rationale
The paper performs direct empirical measurements of gender-occupation bias via token co-occurrence statistics in the Dolma pretraining corpus and via zero-shot prompting outputs from the OLMo model, then compares the two. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing theorems, or ansatzes are present. All reported quantities are independent observations rather than quantities forced by construction from the inputs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.