Quantifying Algorithmic Biases over Time

Ishaan Singh; Vivek K. Singh

arxiv: 1907.01671 · v1 · pith:YHY5UAYMnew · submitted 2019-07-02 · 💻 cs.CY · cs.LG

Quantifying Algorithmic Biases over Time

Vivek K. Singh , Ishaan Singh This is my paper

Pith reviewed 2026-05-25 10:16 UTC · model grok-4.3

classification 💻 cs.CY cs.LG

keywords algorithmic biastemporal variationgender biasimage searchTwitterlongitudinal studynursedoctor

0 comments

The pith

Image search biases for gender terms can reverse direction across consecutive days.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes metrics to measure how biases in algorithms change over time instead of assuming they stay fixed. It tests these metrics on Twitter image search results for the hashtags #Nurse and #Doctor, tracking the gender of people shown in the images each day for 21 days. The measurements reveal that the level of bias shifts substantially and that the apparent direction of the bias can point the opposite way on different days. This pattern indicates that studies based on a single snapshot cannot capture how algorithmic bias actually behaves.

Core claim

Biases in algorithmic outputs vary significantly over time, and the direction of bias can appear different on different days, so that one-shot measurements may not suffice for understanding algorithmic bias.

What carries the argument

A set of intuitive metrics for quantifying day-to-day variations in gender representation within image search results.

If this is right

Multiple measurements spaced over time are required to obtain a reliable picture of algorithmic bias.
The same query can produce outputs that favor different genders on different days.
Efforts to detect or reduce bias must account for temporal instability rather than rely on single observations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar daily tracking could be applied to other search platforms and query types to test whether temporal bias patterns are widespread.
Audit processes for algorithms may need to shift from periodic checks to continuous monitoring.
Future experiments could hold the underlying image collection fixed to isolate whether bias changes originate inside the ranking algorithm.

Load-bearing premise

The observed day-to-day changes in image search results are driven by algorithmic evolution rather than external factors such as shifts in user-uploaded content, platform indexing changes, or sampling variation in the search API.

What would settle it

If repeated searches on the same day produce identical gender distributions while searches across days show consistent shifts that align with known algorithm updates, the claim that bias varies due to algorithmic change would be supported; the opposite pattern would falsify it.

read the original abstract

Algorithms now permeate multiple aspects of human lives and multiple recent results have reported that these algorithms may have biases pertaining to gender, race, and other demographic characteristics. The metrics used to quantify such biases have still focused on a static notion of algorithms. However, algorithms evolve over time. For instance, Tay (a conversational bot launched by Microsoft) was arguably not biased at its launch but quickly became biased, sexist, and racist over time. We suggest a set of intuitive metrics to study the variations in biases over time and present the results for a case study for genders represented in images resulting from a Twitter image search for #Nurse and #Doctor over a period of 21 days. Results indicate that biases vary significantly over time and the direction of bias could appear to be different on different days. Hence, one-shot measurements may not suffice for understanding algorithmic bias, thus motivating further work on studying biases in algorithms over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The temporal angle on bias is worth raising but the Twitter case study leaves the cause of the shifts unclear.

read the letter

The main thing to know is that this paper flags a practical gap in how we measure bias but its evidence does not pin the day-to-day changes on the algorithm itself. The 21-day image search for #Nurse and #Doctor shows the gender split varying and sometimes reversing, which supports the claim that a single snapshot can mislead. That observation is straightforward and useful as a reminder for anyone running fairness checks on live systems. The authors also correctly note that examples like Tay show bias can emerge over time rather than stay fixed. The work is new in applying this lens to image search results and in proposing simple metrics to track it. The execution stays grounded in a real platform rather than abstract theory. The soft spot is the attribution. Nothing isolates algorithmic evolution from new uploads under those hashtags, indexing updates on Twitter, or normal sampling noise in the search API. Without controls or discussion of those factors the variation cannot be confidently called algorithmic. The metrics themselves are called intuitive but receive no formal definition, no error bars, and no statistical tests in the available text. This leaves the central result more illustrative than conclusive. The paper is aimed at fairness researchers and practitioners who audit search or recommendation systems. A reader already working on temporal monitoring would find the example helpful even if preliminary. It deserves peer review because the underlying question matters for deployed systems and the authors have done the initial empirical legwork; referees can push for the missing controls and clearer metric definitions.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a set of intuitive metrics to quantify variations in algorithmic biases over time, motivated by the observation that algorithms evolve (e.g., the Tay chatbot example). It presents results from a 21-day case study measuring gender representation in Twitter image search results for the hashtags #Nurse and #Doctor, concluding that biases vary significantly day-to-day and can reverse direction, implying that one-shot measurements are insufficient.

Significance. If the temporal variations can be confidently attributed to algorithmic evolution rather than external factors, the work would usefully highlight limitations of static bias audits and motivate longitudinal measurement frameworks in algorithmic fairness research.

major comments (2)

[Case study] Case study (21-day Twitter image search experiment): the central claim that observed day-to-day changes demonstrate algorithmic bias variation rests on attributing those changes to the algorithm, yet the manuscript provides no controls, measurements, or discussion isolating algorithmic updates from shifts in user-uploaded content, platform indexing changes, or sampling variation in the search API.
[Metrics] Metrics section: no formal definitions, formulas, or statistical tests are supplied for the proposed bias-variation metrics; the abstract and case study report no error bars, confidence intervals, or hypothesis tests on the daily gender-representation measurements.

minor comments (2)

The manuscript would benefit from explicit pseudocode or equations defining the suggested metrics to allow replication.
Clarify the exact procedure for collecting and labeling the image search results (e.g., how many results per day, labeling criteria for gender).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on our manuscript. We address each major comment below.

read point-by-point responses

Referee: [Case study] Case study (21-day Twitter image search experiment): the central claim that observed day-to-day changes demonstrate algorithmic bias variation rests on attributing those changes to the algorithm, yet the manuscript provides no controls, measurements, or discussion isolating algorithmic updates from shifts in user-uploaded content, platform indexing changes, or sampling variation in the search API.

Authors: We agree that the case study does not include controls or measurements to isolate algorithmic changes from other factors such as user content shifts or API sampling. The case study was intended to show that large day-to-day variations occur and thereby motivate temporal metrics. We will add a limitations discussion section that explicitly acknowledges these confounding possibilities and notes that the variations cannot be definitively attributed to the algorithm alone. revision: partial
Referee: [Metrics] Metrics section: no formal definitions, formulas, or statistical tests are supplied for the proposed bias-variation metrics; the abstract and case study report no error bars, confidence intervals, or hypothesis tests on the daily gender-representation measurements.

Authors: We agree that formal definitions, formulas, and statistical analysis are absent. In the revision we will supply explicit mathematical definitions and formulas for the metrics. We will also add error bars or confidence intervals to the daily measurements and include appropriate statistical tests for the significance of observed temporal changes. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical case study with no derivations or self-referential fits

full rationale

The paper reports direct observations from a 21-day Twitter image search case study for #Nurse and #Doctor without any equations, parameter fitting, or derivation chain. The central claim—that bias metrics vary day-to-day—rests on raw counts of gender representation in returned images rather than any self-definition, fitted-input prediction, or self-citation load-bearing step. No load-bearing premise reduces to the paper's own inputs by construction, satisfying the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5680 in / 1013 out tokens · 21325 ms · 2026-05-25T10:16:16.974899+00:00 · methodology

Quantifying Algorithmic Biases over Time

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)