arxiv: 2512.09292 · v2 · submitted 2025-12-10 · 💻 cs.CL · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Identifying Bias in Machine-generated Text Detection

Kevin Stowe , Svetlana Afanaseva , Rodolfo Raimundo , Yitao Sun , Kailash Patil

Authors on Pith no claims yet

Pith reviewed 2026-05-17 00:09 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords bias detectionmachine-generated textstudent essaysELL statusdemographic attributesregression analysissubgroup analysisAI fairness

0 comments

The pith

Machine-generated text detectors show biases that flag non-White English-language learner essays as artificial more often than others.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors assembled a set of real student essays and ran them through 16 different detection systems. They measured how often each system labeled the writing as machine-generated when broken down by student gender, race, English learner status, and economic background. Several systems showed patterns that treated essays from certain groups differently, with English-language learner writing and especially non-White ELL writing more likely to be called machine-made. Economically disadvantaged students' essays tended to receive the opposite label. Human readers asked to do the same task displayed none of these group-based differences.

Core claim

While detection systems generally perform well on the task, they produce inconsistent biases across demographic attributes when applied to student essays: several models classify writing from disadvantaged groups as machine-generated, ELL essays are more likely to receive that label, economically disadvantaged students' essays are less likely to receive it, and non-White ELL essays are disproportionately classified as machine-generated compared with White ELL essays. Regression models confirm the statistical significance of these effects, and subgroup analyses highlight the interactions. Human annotators show poor overall accuracy on the detection task but no comparable demographic biases.

What carries the argument

Regression-based models and subgroup analysis applied to a curated dataset of student essays evaluated by 16 detection systems across the four attributes of gender, race/ethnicity, ELL status, and economic status.

If this is right

Educational tools that rely on these detectors may produce unfair outcomes for English-language learners and non-White students.
Economic status shows an opposite bias in some systems, suggesting multiple independent mechanisms at work.
Human judgment does not exhibit the same demographic skews, pointing to the detectors themselves as the source of the disparity.
Bias patterns are not uniform, so mitigation approaches would need to be tested system by system rather than applied universally.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of these detectors in schools or platforms could amplify existing disparities in how student work is evaluated for authenticity.
Testing protocols for future detectors should include stratified samples across ELL status and race to catch interactions before release.
The gap between human and machine performance on this task suggests room for hybrid systems that combine both.

Load-bearing premise

The curated collection of student essays is representative of the populations studied and any observed differences in classification rates come from detector biases rather than unmeasured differences in the content or style of the essays themselves.

What would settle it

A follow-up study that collects a new, larger set of student essays from the same demographic groups, runs the identical 16 detectors, and finds no statistically significant differences in machine-generated labels across gender, race, ELL status, or economic status.

Figures

Figures reproduced from arXiv: 2512.09292 by Kailash Patil, Kevin Stowe, Rodolfo Raimundo, Svetlana Afanaseva, Yitao Sun.

**Figure 1.** Figure 1: Pseudo-R2 values from the regression analysis plotted against AUROC scores for each model. and Glimpse models, as well as all three trained variants, exhibit negative coefficients, which in two cases correspond with higher dominance values. 6.4 Overall Results From the regression analysis, we conclude the following: (1) ELL status appears to be a major contributing factor, with ELL student essays more li… view at source ↗

**Figure 2.** Figure 2: Pearson correlation for the predictions for [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

The meteoric rise in text generation capability has been accompanied by parallel growth in interest in machine-generated text detection: the capability to identify whether a given text was generated using a model or written by a person. While detection models show strong performance, they have the capacity to cause significant negative impacts. We explore potential biases in English machine-generated text detection systems. We curate a dataset of student essays and assess 16 different detection systems for bias across four attributes: gender, race/ethnicity, English-language learner (ELL) status, and economic status. We evaluate these attributes using regression-based models to determine the significance and power of the effects, as well as performing subgroup analysis. We find that while biases are generally inconsistent across systems, there are several key issues: several models tend to classify disadvantaged groups as machine-generated, ELL essays are more likely to be classified as machine-generated, economically disadvantaged students' essays are less likely to be classified as machine-generated, and non-White ELL essays are disproportionately classified as machine-generated relative to their White counterparts. Finally, we perform human annotation and find that while humans perform generally poorly at the detection task, they show no significant biases on the studied attributes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Some detectors flag ELL essays as machine-generated more often with intersectional patterns, but unmeasured linguistic differences in the essays could explain part of the results.

read the letter

Several machine-generated text detectors show biases against ELL essays by labeling them as machine-written more frequently, and non-White ELL essays face even higher rates compared to White counterparts, while economically disadvantaged essays are less likely to be flagged that way. These patterns aren't consistent across all 16 systems tested on the student essay dataset. The regressions and subgroup analysis give some statistical backing to the reported effects, and the human annotation arm shows that people perform poorly overall but without matching demographic skews. This is a direct empirical application of bias auditing methods to detection systems using real student data across four attributes, which extends earlier fairness work without simply repeating it. The quantified observations on ELL and intersectional effects come from their 16-system evaluation and read as new results here. The main limitation is that the analysis conditions only on the demographic variables without controls for essay length, syntactic complexity, lexical diversity, or prompt adherence. ELL and economically disadvantaged writing often differs on those surface features even for identical prompts, so detectors sensitive to style could produce the observed false positive rates without it amounting to demographic bias per se. The human annotations do not directly test whether the same linguistic cues drive model and human decisions, leaving that alternative open. Readers working on detector deployment in schools or moderation will find the specific patterns and risks relevant for thinking about fairness standards. Fairness researchers in NLP get a concrete multi-attribute example on student data. The paper shows clear empirical thinking and honest engagement with the literature on bias, so it deserves a serious referee even if revisions should add robustness checks for linguistic confounds. I would send it out for peer review.

Referee Report

2 major / 2 minor

Summary. The paper curates a dataset of student essays and evaluates 16 machine-generated text detectors for bias across gender, race/ethnicity, ELL status, and economic status. Using regression-based significance testing and subgroup analysis, it reports generally inconsistent biases but identifies several patterns: some models classify disadvantaged groups as machine-generated, ELL essays are more likely to be flagged as machine-generated, economically disadvantaged essays are less likely to be flagged, and non-White ELL essays show disproportionate flagging relative to White counterparts. A human annotation study finds poor overall detection performance but no significant demographic biases.

Significance. If the reported patterns are robust to linguistic confounds, the work provides actionable evidence of fairness risks in deploying detection systems for educational assessment. Strengths include the multi-detector evaluation, regression-based testing with subgroup analysis, and direct comparison to human annotators on the same held-out essays.

major comments (2)

[Section 4] Section 4 regressions condition only on the four target attributes (gender, race, ELL, economic status) without controls for essay length, syntactic complexity, lexical diversity, or prompt adherence. Since ELL and economically disadvantaged essays are known to differ systematically on these surface features even for identical prompts, the observed coefficients may reflect detector sensitivity to style rather than demographic bias; this directly affects the central interpretation of the results as evidence of bias.
[Section 5] The human-annotation arm (Section 5) does not measure whether the same uncontrolled linguistic features drive both human and model decisions, leaving open the possibility that the lack of human bias is due to different decision criteria rather than absence of bias.

minor comments (2)

[Section 5] Provide explicit sample sizes, essay length statistics, and inter-annotator agreement for the human study to allow readers to assess power and reliability.
[Section 3] Clarify the exact prompts used for the student essays and whether all essays were written to the same set of prompts, as prompt variation could interact with the reported subgroup effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and outline the revisions we will make to strengthen the manuscript's analysis of potential confounds.

read point-by-point responses

Referee: [Section 4] Section 4 regressions condition only on the four target attributes (gender, race, ELL, economic status) without controls for essay length, syntactic complexity, lexical diversity, or prompt adherence. Since ELL and economically disadvantaged essays are known to differ systematically on these surface features even for identical prompts, the observed coefficients may reflect detector sensitivity to style rather than demographic bias; this directly affects the central interpretation of the results as evidence of bias.

Authors: We agree that the regressions in Section 4 do not control for linguistic features such as essay length, syntactic complexity, lexical diversity, or prompt adherence, which could confound the interpretation of demographic effects. In the revised manuscript, we will add new regression specifications that include these covariates as controls. We will report both the original and controlled models to assess whether the observed patterns persist, thereby clarifying the extent to which the results reflect demographic bias versus sensitivity to writing style. revision: yes
Referee: [Section 5] The human-annotation arm (Section 5) does not measure whether the same uncontrolled linguistic features drive both human and model decisions, leaving open the possibility that the lack of human bias is due to different decision criteria rather than absence of bias.

Authors: We acknowledge that Section 5 does not directly examine whether linguistic features influence human and model decisions in comparable ways. To address this, we will extend the analysis in the revised Section 5 to include correlations between key linguistic features (length, complexity, diversity) and both human annotations and model predictions. This addition will help determine if humans and detectors rely on similar cues and will strengthen the comparison between human and automated bias findings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical analysis with direct observations

full rationale

The paper curates a student-essay dataset and evaluates 16 detectors via regression and subgroup analysis on observed machine-generated labels. No derivations, equations, or predictions exist that reduce to fitted inputs or self-definitions by construction. All claims rest on held-out classifications and human annotations rather than any self-referential structure. Self-citations (if present) are not load-bearing for the central empirical findings.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard statistical assumptions for regression analysis of classification outcomes; no free parameters or invented entities are introduced beyond the empirical setup.

axioms (1)

domain assumption Linear regression assumptions hold for modeling the relationship between demographic attributes and detector outputs
Invoked when using regression to determine significance and power of effects on bias attributes.

pith-pipeline@v0.9.0 · 5511 in / 1183 out tokens · 40358 ms · 2026-05-17T00:09:28.970305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics. S.A. Crossley, Y . T...

work page 2024
[2]

Erenay Dayanik, Ngoc Thang Vu, and Sebastian Padó

Quantifying social biases in nlp: A generaliza- tion and empirical comparison of extrinsic fairness metrics.Transactions of the Association for Compu- tational Linguistics, 9:1249–1267. Erenay Dayanik, Ngoc Thang Vu, and Sebastian Padó

work page
[3]

Bias identification and attribution in NLP mod- els with regression and effect sizes.Northern Euro- pean Journal of Language Technology, 8. Desklib. 2025. ai-text-detector: Ai-generated text de- tection model. Accessed: 2025-07-10. Sunipa Dev, Masoud Monajatipoor, Anaelia Ovalle, Ar- jun Subramonian, Jeff Phillips, and Kai-Wei Chang

work page 2025
[4]

InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994, Online and Punta Cana, Dominican Re- public

Harms of gender exclusivity and challenges in non-binary representation in language technologies. InProceedings of the 2021 Conference on Empiri- cal Methods in Natural Language Processing, pages 1968–1994, Online and Punta Cana, Dominican Re- public. Association for Computational Linguistics. Liam Dugan, Alyssa Hwang, Filip Trhlík, Andrew Zhu, Josh Magnu...

work page arXiv 2021
[5]

OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Spotting llms with binoculars: zero-shot detection of machine-generated text. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org. 10 Xiaomeng Hu, Pin-Yu Chen, and Tsung-Yi Ho. 2023. Radar: Robust ai-text detection via adversarial learn- ing. InAdvances in Neural Information Processing Systems, volume 36, pages 15077–...

work page internal anchor Pith review arXiv 2023
[6]

Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, and Dongwon Lee

Outfox: Llm-generated essay detection through in-context learning with adversarially gener- ated examples.Proceedings of the AAAI Conference on Artificial Intelligence, 38(19):21258–21266. Jooyoung Lee, Xiaochen Zhu, Georgi Karadzhov, Tom Stafford, Andreas Vlachos, and Dongwon Lee

work page
[7]

Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou

Collaborative evaluation of deepfake text with deliberation-enhancing dialogue systems.Preprint, arXiv:2503.04945. Weixin Liang, Mert Yuksekgonul, Yining Mao, Eric Wu, and James Zou. 2023. Gpt detectors are bi- ased against non-native english writers.Patterns, 4(7):100779. Accessed: 2025-03-12. Decheng Liu, Zongqi Wang, Chunlei Peng, Nannan Wang, Ruimin H...

work page arXiv 2023
[8]

text\":\

Angry men, sad women: Large language mod- els reflect gendered stereotypes in emotion attribution. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7682–7696, Bangkok, Thailand. Association for Computational Linguistics. Felipe Romero-Moreno. 2025. Deepfake detection in generative ai:...

work page arXiv 2025