Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

Ayan Majumdar; Feihao Chen; Jinghui Li; Xiaozhen Wang

arxiv: 2510.04641 · v3 · submitted 2025-10-06 · 💻 cs.CL · cs.CY· cs.LG

Evaluating LLMs for Demographic-Targeted Social Bias Detection: A Comprehensive Benchmark Study

Ayan Majumdar , Feihao Chen , Jinghui Li , Xiaozhen Wang This is my paper

Pith reviewed 2026-05-18 10:35 UTC · model grok-4.3

classification 💻 cs.CL cs.CYcs.LG

keywords LLM evaluationsocial bias detectiondemographic biasesmulti-label classificationfine-tuningdata auditingbenchmark studyhate speech

0 comments

The pith

Fine-tuned smaller LLMs detect demographic-targeted biases at scale yet leave gaps for multi-group cases and some axes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks LLMs on spotting social biases aimed at specific demographic groups in English web text. It treats detection as a multi-label problem that identifies one or more targeted identities at once according to a fixed demographic taxonomy. Models are tested with prompting, in-context learning, and fine-tuning across twelve datasets that vary in content type and demographic coverage. Fine-tuned compact models perform well enough to support large-scale auditing of training corpora. The same results also show that accuracy falls when biases hit several demographics together or when certain demographic categories are involved.

Core claim

Systematic evaluation of LLMs across scales and techniques on twelve datasets demonstrates that fine-tuning smaller models delivers practical performance for scalable detection of demographic-targeted biases framed as multi-label identification of targeted identities. At the same time the results reveal consistent shortfalls when the same biases affect multiple demographics simultaneously and when particular demographic axes are examined in isolation.

What carries the argument

Multi-label classification of targeted demographic identities via a demographic-focused taxonomy, evaluated through prompting, in-context learning, and fine-tuning.

If this is right

Fine-tuned smaller models can serve as practical tools for auditing large web-scraped corpora before they are used to train general-purpose AI systems.
Detection performance must improve specifically for biases that simultaneously target more than one demographic group.
Accuracy differences across demographic axes indicate that current methods are not yet uniform in coverage.
Regulatory requirements for bias auditing in AI training data can be met more readily by multi-label approaches than by single-category methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These detection methods could be inserted into existing data-cleaning pipelines to lower the chance that harmful biases reach downstream model training.
Applying the same benchmark protocol to non-English text would test whether the observed gaps are language-specific or more general.
Ensemble or hybrid systems that combine fine-tuned models with rule-based checks might close some of the multi-demographic shortfalls without increasing model size.

Load-bearing premise

The twelve datasets and demographic-focused taxonomy capture a sufficiently representative sample of real-world demographic-targeted biases in large web-scraped text.

What would settle it

A fresh collection of web-scraped text with verified multi-demographic bias labels on which fine-tuned models show no meaningful improvement over simple baselines would falsify the claim of scalable detection.

read the original abstract

Large-scale web-scraped text corpora used to train general-purpose AI models often contain harmful demographic-targeted social biases, creating a regulatory need for data auditing and developing scalable bias-detection methods. Although prior work has investigated biases in text datasets and related detection methods, these studies remain narrow in scope. They typically focus on a single content type (e.g., hate speech), cover limited demographic axes, overlook biases affecting multiple demographics simultaneously, and analyze limited techniques. Consequently, practitioners lack a holistic understanding of the strengths and limitations of recent large language models (LLMs) for automated bias detection. In this study, we conduct a comprehensive benchmark study on English texts to assess the ability of LLMs in detecting demographic-targeted social biases. To align with regulatory requirements, we frame bias detection as a multi-label task of detecting targeted identities using a demographic-focused taxonomy. We then systematically evaluate models across scales and techniques, including prompting, in-context learning, and fine-tuning. Using twelve datasets spanning diverse content types and demographics, our study demonstrates the promise of fine-tuned smaller models for scalable detection. However, our analyses also expose persistent gaps across demographic axes and multi-demographic targeted biases, underscoring the need for more effective and scalable detection frameworks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuned smaller models look promising for multi-label demographic bias detection on this benchmark, but gaps remain on overlapping cases.

read the letter

The main point here is that fine-tuned smaller models show promise for scalable detection of demographic-targeted social biases in this multi-label setup, while the results also point to ongoing problems with biases that hit multiple demographics together. What the paper actually adds is a broader benchmark: a demographic taxonomy turned into multi-label detection, tested on twelve datasets covering different content, plus a clean comparison of prompting, in-context learning, and fine-tuning across model sizes. That is more comprehensive than the single-focus studies it builds on. They handle the execution reasonably well. Dataset details and metric definitions are laid out, results include breakdowns by axis, and the design accounts for model scale without obvious circularity or fitting issues. The main limitation is generalizability. The datasets are diverse but the paper correctly flags that they may not capture every real-world bias pattern in web data. The English-only scope is another boundary that keeps the findings from applying everywhere. This work is for teams doing data audits or developing bias mitigation tools. Anyone needing numbers on which LLM techniques work better for this task will find it useful. I would send it to peer review. The empirical foundation is solid enough to justify referee time, even with the expected discussion on scope.

Referee Report

1 major / 3 minor

Summary. The manuscript presents a benchmark study assessing LLMs' performance in detecting demographic-targeted social biases framed as a multi-label task. Using a custom taxonomy and twelve datasets, it compares prompting, in-context learning, and fine-tuning approaches across model scales, concluding that fine-tuned smaller models offer a scalable solution while noting ongoing challenges with certain demographic groups and multi-target biases.

Significance. This study contributes to the field by providing a more comprehensive evaluation than previous narrow-scope works, offering practical insights for developing bias detection tools that align with regulatory demands for auditing training data. The emphasis on multi-label detection and analysis of gaps across axes is particularly valuable for advancing more nuanced bias mitigation strategies. The systematic nature of the experiments, including controls for model scale, adds to its utility as a reference benchmark.

major comments (1)

§4, main results tables: the claim that fine-tuned smaller models show promise for scalable detection is supported by F1 comparisons, but the tables lack error bars, standard deviations across runs, or statistical significance tests for differences versus larger models or baselines; this weakens the robustness of the scalability conclusion.

minor comments (3)

Abstract: high-level conclusions are stated without any key quantitative results (e.g., specific F1 ranges), which reduces the abstract's informativeness given the empirical nature of the work.
§3.1: a summary table listing the twelve datasets with size, source, and demographic coverage would improve clarity and allow readers to quickly assess coverage.
Figure 3: axis labels and legend for prompting versus fine-tuning conditions overlap slightly, affecting readability of the performance trends.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and recommendation of minor revision. We are pleased that the study is recognized for its comprehensive scope and practical insights. Below, we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [—] §4, main results tables: the claim that fine-tuned smaller models show promise for scalable detection is supported by F1 comparisons, but the tables lack error bars, standard deviations across runs, or statistical significance tests for differences versus larger models or baselines; this weakens the robustness of the scalability conclusion.

Authors: We agree with the referee that incorporating measures of variability and statistical significance would enhance the robustness of our conclusions regarding the scalability of fine-tuned smaller models. In the original experiments, fine-tuning was performed with a fixed seed for reproducibility, but we will rerun the fine-tuning experiments with multiple random seeds (e.g., 3-5 runs) to compute standard deviations and include error bars in the revised tables. Additionally, we will add statistical significance tests (such as McNemar's test or paired t-tests on F1 scores) to compare the fine-tuned models against larger models and baselines. These updates will be reflected in Section 4 and the corresponding tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an empirical benchmark study with no derivations, equations, or self-referential definitions. Claims rest on evaluation across 12 external datasets using standard prompting, in-context learning, and fine-tuning regimes with precision/recall/F1 metrics. No load-bearing step reduces to fitted parameters or self-citation chains; the taxonomy and multi-label framing are constructed from prior literature and applied to independent data sources.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility into exact parameters or entities; the taxonomy and dataset selection function as key unverified assumptions for the multi-label task.

axioms (1)

domain assumption The demographic-focused taxonomy accurately captures targeted identities in biased text.
Invoked to frame bias detection as a multi-label task aligned with regulatory needs.

pith-pipeline@v0.9.0 · 5761 in / 1317 out tokens · 42191 ms · 2026-05-18T10:35:32.056656+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We frame bias detection as a multi-label task using a demographic-focused taxonomy... evaluate models across scales and techniques, including prompting, in-context learning, and fine-tuning.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Fine-tuned encoder models... achieve markedly lower disparities... persistent gaps across demographic axes and multi-demographic targeted biases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.