How far can bias go? Tracing bias from pretraining data to alignment

Abdullatif K\"oksal; Alina Leidinger; Anna Korhonen; Hinrich Sch\"utze; Marion Thaler

arxiv: 2411.19240 · v2 · submitted 2024-11-28 · 💻 cs.CL

How far can bias go? Tracing bias from pretraining data to alignment

Marion Thaler , Abdullatif K\"oksal , Alina Leidinger , Anna Korhonen , Hinrich Sch\"utze This is my paper

Pith reviewed 2026-05-23 08:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords bias amplificationpre-training datagender biasinstruction tuningLLM evaluationDolma datasetOLMo model

0 comments

The pith

Biases present in pre-training data are amplified in LLM outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper traces gender-occupation bias from the Dolma pre-training dataset through the OLMo model and its alignment stage. It measures how bias in the raw training data shows up in model behavior using zero-shot prompting and token co-occurrence counts. The central finding is that these biases become stronger in the model's generated outputs rather than staying at the level seen in the data. Instruction-tuning reduces some representational bias but leaves stereotypical gender-occupation associations largely intact. The work argues that bias mitigation must start at the pre-training stage because later steps only partially counteract the amplification.

Core claim

Biases present in pre-training data are amplified in model outputs. Using zero-shot prompting and token co-occurrence analyses on the Dolma dataset and OLMo model, the study shows that gender-occupation stereotypes grow stronger in the model's responses. Instruction-tuning partially alleviates representational bias while still maintaining overall stereotypical gender associations, whereas changes in hyperparameters and prompt types have smaller effects.

What carries the argument

Correlation analysis between gender-occupation bias measured in the Dolma pre-training corpus and the same bias expressed in OLMo model outputs via zero-shot prompts and token co-occurrence statistics.

If this is right

Instruction-tuning reduces some forms of representational bias but does not remove stereotypical gender-occupation links.
Variation in prompt type or model hyperparameters has limited impact on the level of bias expressed.
Mitigating bias at the pre-training data stage is necessary because downstream alignment only partially compensates.
The same tracing method can be applied to other bias types and model families to check whether amplification is general.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If bias amplification occurs early, then filtering or re-weighting pre-training data could reduce downstream bias more effectively than post-hoc alignment techniques.
The persistence of stereotypes after instruction-tuning raises the question of whether alignment objectives need explicit anti-stereotype terms.
Extending the co-occurrence method to measure bias in intermediate model checkpoints could reveal at which training stage amplification begins.

Load-bearing premise

That zero-shot prompting and token co-occurrence analyses provide a faithful, unbiased measure of how pretraining bias manifests in model behavior without introducing their own artifacts or selection effects.

What would settle it

A direct count showing that the frequency or strength of gender-occupation stereotypes in OLMo outputs is equal to or lower than their frequency in the Dolma training data would falsify the amplification claim.

read the original abstract

As LLMs are increasingly integrated into user-facing applications, addressing biases that perpetuate societal inequalities is crucial. While much work has gone into measuring or mitigating biases in these models, fewer studies have investigated their origins. Therefore, this study examines the correlation between gender-occupation bias in pre-training data and their manifestation in LLMs, focusing on the Dolma dataset and the OLMo model. Using zero-shot prompting and token co-occurrence analyses, we explore how biases in training data influence model outputs. Our findings reveal that biases present in pre-training data are amplified in model outputs. The study also examines the effects of prompt types, hyperparameters, and instruction-tuning on bias expression, finding instruction-tuning partially alleviating representational bias while still maintaining overall stereotypical gender associations, whereas hyperparameters and prompting variation have a lesser effect on bias expression. Our research traces bias throughout the LLM development pipeline and underscores the importance of mitigating bias at the pretraining stage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper traces gender bias from Dolma co-occurrences into OLMo outputs and reports amplification, with instruction tuning offering partial relief.

read the letter

The core finding is that gender-occupation bias measured by token co-occurrence in Dolma appears stronger in zero-shot outputs from OLMo, and that instruction tuning reduces some of the representational skew while leaving stereotypical associations intact. Hyperparameters and prompt variations matter less in their tests. This is a direct empirical link between one pretraining corpus and one model rather than a new general method, which is useful because most bias papers stop at the final model. The work is honest about its scope and uses an open dataset and model, so the measurements can be checked. The main limitation is that zero-shot prompting and simple co-occurrence counts can pick up artifacts from prompt wording, occupation list selection, or how the model was aligned, and the paper does not appear to cross-check against internal activations or alternative metrics like embedding similarities. With only this one dataset-model pair the amplification result is suggestive but not yet robust to broader claims about pretraining as the dominant source. The statistical reporting and controls for multiple comparisons would need a close look in review. This is the sort of pipeline-tracing study that belongs in the literature on LLM bias origins. A referee should see it to verify the numbers and ask for robustness checks, but it is not desk-reject material.

Referee Report

2 major / 2 minor

Summary. The paper examines the correlation between gender-occupation bias in the Dolma pre-training dataset and its manifestation in the OLMo model. Using token co-occurrence analysis on the data and zero-shot prompting on the model, it claims that pretraining biases are amplified in model outputs. It further studies the effects of prompt types, hyperparameters, and instruction-tuning, finding that instruction-tuning partially alleviates representational bias while preserving overall stereotypical gender associations.

Significance. If the empirical mappings hold, the work provides concrete evidence that bias transmission occurs from pretraining data to aligned model outputs in an open model-dataset pair, underscoring the need for mitigation at the data curation stage rather than solely post-alignment. The observational tracing through the pipeline is a strength for an empirical study in this area.

major comments (2)

[Methods / Results (bias measurement)] The central amplification claim (abstract) rests on the assumption that token co-occurrence statistics in Dolma directly predict and are exceeded by bias measured via zero-shot prompts in OLMo. Without reported validation (e.g., comparison of co-occurrence to the model's actual next-token distributions on held-out data or ablation on prompt phrasing), differences may reflect metric artifacts or selection effects in occupation lists rather than faithful data-to-model transmission.
[Results (instruction-tuning analysis)] The claim that instruction-tuning 'partially alleviating representational bias while still maintaining overall stereotypical gender associations' (abstract) requires effect-size reporting and controls: what is the magnitude of change in bias metrics pre- vs. post-tuning, and are results robust to multiple prompt templates or random seeds? The current description leaves open whether the partial alleviation is statistically reliable or driven by specific choices.

minor comments (2)

[Methods] Clarify the exact definition of 'amplification' (e.g., ratio of bias scores, statistical test) and whether base-rate normalization was applied in the co-occurrence analysis.
[Abstract / Introduction] The abstract states the study 'traces bias throughout the LLM development pipeline,' but the reported analyses focus on pretraining data and post-alignment outputs; specify whether intermediate stages (e.g., SFT data) were examined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on methodological validation and statistical robustness. We address each major comment below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses

Referee: The central amplification claim (abstract) rests on the assumption that token co-occurrence statistics in Dolma directly predict and are exceeded by bias measured via zero-shot prompts in OLMo. Without reported validation (e.g., comparison of co-occurrence to the model's actual next-token distributions on held-out data or ablation on prompt phrasing), differences may reflect metric artifacts or selection effects in occupation lists rather than faithful data-to-model transmission.

Authors: We agree that direct validation would better support the transmission interpretation. Our current approach follows common practice in the bias literature by treating co-occurrence as a data-level proxy, but we will add (1) a comparison of occupation-gender co-occurrence rates against OLMo’s next-token log-probabilities on a held-out Dolma subset and (2) an ablation across three prompt phrasings to quantify sensitivity. These additions will be reported in a new subsection of the Methods and Results. revision: yes
Referee: The claim that instruction-tuning 'partially alleviating representational bias while still maintaining overall stereotypical gender associations' (abstract) requires effect-size reporting and controls: what is the magnitude of change in bias metrics pre- vs. post-tuning, and are results robust to multiple prompt templates or random seeds? The current description leaves open whether the partial alleviation is statistically reliable or driven by specific choices.

Authors: We will report Cohen’s d (or equivalent) for the pre- vs. post-instruction-tuning change in each bias metric. We will also rerun the instruction-tuning analysis across five distinct prompt templates and three random seeds, presenting means and standard deviations to demonstrate robustness. These results will be added to the instruction-tuning subsection. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical measurements

full rationale

The paper performs direct empirical measurements of gender-occupation bias via token co-occurrence statistics in the Dolma pretraining corpus and via zero-shot prompting outputs from the OLMo model, then compares the two. No mathematical derivations, fitted parameters renamed as predictions, self-citation load-bearing theorems, or ansatzes are present. All reported quantities are independent observations rather than quantities forced by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are stated or derivable.

pith-pipeline@v0.9.0 · 5704 in / 959 out tokens · 16931 ms · 2026-05-23T08:23:30.995816+00:00 · methodology

How far can bias go? Tracing bias from pretraining data to alignment

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)