BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

arxiv: 2602.23580 · v2 · submitted 2026-02-27 · 💻 cs.CL · cs.AI

BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation

Yun Wang , Xuansheng Wu , Jingyuan Huang , Lei Liu , Xiaoming Zhai , Ninghao Liu This is my paper

Pith reviewed 2026-05-15 19:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords bias mitigationdata augmentationautomated scoringEnglish Language Learnerseducational assessmentfairness in AIdeep learning

0 comments p. Extension

The pith

A data synthesis method reduces bias against high-scoring English Language Learners in automated test scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BRIDGE, a framework that generates synthetic high-scoring ELL responses by combining authentic ELL language patterns with construct-relevant content from non-ELL samples. This addresses the scarcity of minority samples that causes models to underpredict ELL students with comparable knowledge but different linguistic styles. Experiments show it reduces prediction bias while keeping overall accuracy, achieving results similar to adding real data. The approach is designed for low-resource settings in educational assessment.

Core claim

BRIDGE synthesizes high-scoring ELL samples by pasting rubric-aligned knowledge and evidence from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns, validated by a discriminator model. This mitigates representation bias in automated scoring systems trained on imbalanced data.

What carries the argument

BRIDGE, a Bias-Reducing Inter-group Data GEneration framework that pastes construct-relevant content from non-ELL into ELL linguistic patterns, guarded by a discriminator.

If this is right

Prediction bias for high-scoring ELL students is reduced on CAST datasets.
Overall scoring performance is maintained.
Fairness gains are comparable to those from using additional real human data.
The method offers a cost-effective alternative for equitable scoring in large-scale assessments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar inter-group augmentation could apply to other imbalanced domains like medical diagnosis or hiring algorithms where minority groups use different expression styles.
Future work might test if the discriminator can be replaced with human evaluation or more advanced quality checks.
The separation of content and language assumes rubrics capture construct-relevant parts cleanly, which may need validation per subject.

Load-bearing premise

Construct-relevant content can be cleanly separated from linguistic patterns and pasted into ELL responses without introducing new artifacts or biases that the discriminator misses.

What would settle it

A human evaluation study where raters judge whether synthetic ELL samples contain unnatural artifacts or altered meaning compared to real ones, or an experiment showing no bias reduction when applied.

Figures

Figures reproduced from arXiv: 2602.23580 by Jingyuan Huang, Lei Liu, Ninghao Liu, Xiaoming Zhai, Xuansheng Wu, Yun Wang.

**Figure 1.** Figure 1: Overview of the bias amplification loop and the BRIDGE framework. (a) Bias Propagation: Representation bias in training data creates a feedback loop that reinforces educational disparities. (b) ERM Vulnerability: Under ERM, the decision boundary for high scores is skewed by the majority non-ELL group, causing systematic underprediction of the sparse ELL high-score subgroup. (c) BRIDGE Framework: Our app… view at source ↗

read the original abstract

In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BRIDGE pastes construct-relevant content from non-ELL to ELL responses plus a discriminator to cut scoring bias, but the separation assumption looks shaky and the abstract skips key validation details.

read the letter

The main point is that BRIDGE generates synthetic high-scoring ELL samples by pasting rubric-aligned knowledge from abundant non-ELL responses into real ELL linguistic patterns, then runs them through a discriminator to keep quality up. On the CAST datasets it reports bias reduction for those students that matches what you get from adding more real human-scored data, all while holding overall scoring performance steady.

Referee Report

2 major / 1 minor

Summary. The paper proposes BRIDGE, a Bias-Reducing Inter-group Data GEneration framework that synthesizes high-scoring ELL samples by transferring construct-relevant content from abundant high-scoring non-ELL responses into authentic ELL linguistic patterns, augmented by a discriminator for quality control. Experiments on California Science Test (CAST) datasets claim that BRIDGE reduces prediction bias for high-scoring ELL students while preserving overall scoring performance, achieving fairness gains comparable to adding real human data.

Significance. If the synthetic samples preserve rubric alignment and the discriminator reliably filters artifacts, BRIDGE could offer a practical, low-cost approach to addressing representation bias in automated scoring systems for underrepresented student groups in large-scale educational assessments.

major comments (2)

[Method] The central methodological assumption—that construct-relevant content can be cleanly extracted from non-ELL responses and pasted into ELL linguistic frames without introducing new biases or validity artifacts—is load-bearing for the fairness claim but receives no direct empirical test or ablation (e.g., no comparison of rubric scores or discriminator failure modes on entangled content-language cases).
[Experiments] The abstract asserts that fairness gains are comparable to real human data, yet no concrete bias metrics (e.g., group-wise prediction gaps, demographic parity), baseline models, statistical tests, data-split details, or exclusion criteria are reported, rendering the experimental support for the central claim unverifiable.

minor comments (1)

[Abstract] The term 'pasting' is used informally in the abstract and method overview; a precise algorithmic description or pseudocode would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We have carefully considered the comments and provide point-by-point responses below. We believe the suggested revisions will improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [Method] The central methodological assumption—that construct-relevant content can be cleanly extracted from non-ELL responses and pasted into ELL linguistic frames without introducing new biases or validity artifacts—is load-bearing for the fairness claim but receives no direct empirical test or ablation (e.g., no comparison of rubric scores or discriminator failure modes on entangled content-language cases).

Authors: We acknowledge that the assumption regarding clean extraction and pasting of construct-relevant content is central to our approach and would benefit from more direct empirical scrutiny. While the original manuscript demonstrates the effectiveness through end-to-end performance metrics and the use of a quality discriminator, it does not include a dedicated ablation study on entangled content-language scenarios or direct rubric score comparisons for synthetic samples. In the revised version, we will incorporate an additional ablation experiment that evaluates rubric alignment of synthetic samples via expert annotation and analyzes cases where the discriminator may fail on highly entangled inputs. This will provide stronger evidence for the validity of the generated samples. revision: yes
Referee: [Experiments] The abstract asserts that fairness gains are comparable to real human data, yet no concrete bias metrics (e.g., group-wise prediction gaps, demographic parity), baseline models, statistical tests, data-split details, or exclusion criteria are reported, rendering the experimental support for the central claim unverifiable.

Authors: We agree with the referee that the experimental section would benefit from more explicit reporting of concrete bias metrics, baselines, and setup details to support the claim of comparability to real human data. In the revised manuscript, we will add specific quantitative results, such as the exact group-wise prediction gaps before and after applying BRIDGE, comparisons to baselines, statistical significance tests, detailed data-split information, and exclusion criteria. We will also update the abstract to reference these key findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical data augmentation framework

full rationale

The paper presents BRIDGE as an empirical data-generation method that synthesizes samples by pasting construct-relevant content from non-ELL responses into ELL linguistic frames, then applies a discriminator for quality control. All claims are supported by experiments on external California Science Test (CAST) datasets, with fairness metrics compared directly to real human data baselines. No equations, derivations, or self-referential quantities appear in the provided text; the method does not reduce any prediction or result to a fitted parameter or self-citation by construction. The framework is externally validated rather than internally self-consistent by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that rubric-aligned knowledge content is separable from linguistic style and can be transferred without loss of validity or introduction of artifacts; no explicit free parameters or invented physical entities are stated in the abstract.

axioms (1)

domain assumption Construct-relevant content from non-ELL responses can be isolated and pasted into ELL linguistic patterns while preserving scoring validity.
This separability is the core mechanism of the proposed data generation step.

invented entities (1)

BRIDGE data generation framework no independent evidence
purpose: To synthesize high-scoring ELL samples for bias reduction
Newly introduced method whose effectiveness is claimed via experiments but lacks independent external validation beyond the paper.

pith-pipeline@v0.9.0 · 5597 in / 1387 out tokens · 20146 ms · 2026-05-15T19:26:18.004556+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BRIDGE synthesizes high-scoring ELL samples by 'pasting' construct-relevant content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BiasAmp(f) = Δ_model − Δ_data with decomposition into FP/FN terms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 3 internal anchors

[1]

Teachers College Record108(11), 2282–2303 (2006)

Abedi,J.:Psychometricissuesintheellassessmentandspecialeducationeligibility. Teachers College Record108(11), 2282–2303 (2006)

work page 2006
[2]

World journal of methodology13(5), 373 (2023)

Alkhawaldeh, I.M., Albalkhi, I., Naswhan, A.J.: Challenges and limitations of syn- thetic minority oversampling techniques in machine learning. World journal of methodology13(5), 373 (2023)

work page 2023
[3]

American Educational Research Association, Washington, DC (2014)

American Educational Research Association, American Psychological Association, National Council on Measurement in Education: Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC (2014)

work page 2014
[4]

International Journal of Artificial Intelligence in Education pp

Andersen, N., Mang, J., Goldhammer, F., Zehner, F.: Algorithmic fairness in au- tomatic short answer scoring. International Journal of Artificial Intelligence in Education pp. 1–38 (2025)

work page 2025
[5]

arXiv preprint arXiv:2501.18845 (2025) 14 Y

Chai, Y., Xie, H., Qin, J.S.: Text data augmentation for large language models: A comprehensive survey of methods, challenges, and opportunities. arXiv preprint arXiv:2501.18845 (2025) 14 Y. Wang et al

work page arXiv 2025
[6]

Chaudhari, R., Patel, M.: Deep learning in automated short answer grading: A comprehensivereview.In:ITMWebofConferences.vol.65,p.03003.EDPSciences (2024)

work page 2024
[7]

Chen, D., Yin, C.: Data augmentation for intent classification (2022), https://arxiv.org/abs/2206.05790

work page arXiv 2022
[8]

Journal of Research in Reading 35(2), 115–135 (2012)

Crossley, S.A., McNamara, D.S.: Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication. Journal of Research in Reading 35(2), 115–135 (2012)

work page 2012
[9]

In: Proc

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. In: Proc. of NAACL-HLT (2019)

work page 2019
[10]

Dikli, S., Bleyle, S.: Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing writing22, 1–17 (2014)

work page 2014
[11]

Applied Sciences15(10), 5683 (2025)

Emirtekin, E.: Large language model-powered automated assessment: A systematic review. Applied Sciences15(10), 5683 (2025)

work page 2025
[12]

In: International Conference on Artificial Intelligence in Education

Guo, S., Wang, Y., Yu, J., Wu, X., Ayik, B., Watts, F.M., Latif, E., Liu, N., Liu, L., Zhai, X.: Artificial intelligence bias on english language learners in automatic scoring. In: International Conference on Artificial Intelligence in Education. pp. 268–275. Springer (2025)

work page 2025
[13]

In: International Conference on Machine Learning

Hashimoto, T., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demo- graphics in repeated loss minimization. In: International Conference on Machine Learning. pp. 1929–1938. PMLR (2018)

work page 1929
[14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

arXiv preprint arXiv:1812.08999 (2018)

Leino, K., Black, E., Fredrikson, M., Sen, S., Datta, A.: Feature-wise bias amplifi- cation. arXiv preprint arXiv:1812.08999 (2018)

work page arXiv 2018
[16]

arXiv preprint arXiv:2602.10388 (2026)

Li, Z., Wu, X., Li, Y., Hu, L., Liu, N.: Less is enough: Synthesizing diverse data in feature space of llms. arXiv preprint arXiv:2602.10388 (2026)

work page arXiv 2026
[17]

ACM computing surveys (CSUR)54(6), 1–35 (2021)

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM computing surveys (CSUR)54(6), 1–35 (2021)

work page 2021
[18]

ETS research report series1987(2), i–208 (1987)

Messick, S.: Validity. ETS research report series1987(2), i–208 (1987)

work page 1987
[19]

Frontiers in Psychology13, 937097 (2022)

Palermo, C.: Rater characteristics, response content, and scoring contexts: Decom- posing the determinates of scoring accuracy. Frontiers in Psychology13, 937097 (2022)

work page 2022
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park,S.,Hong,Y.,Heo,B.,Yun,S.,Choi,J.Y.:Themajoritycanhelptheminority: Context-rich minority oversampling for long-tailed classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6887–6896 (2022)

work page 2022
[21]

Advances in Neural Information Processing Systems37, 84384–84408 (2024)

Plecko, D., Bareinboim, E.: Mind the gap: A causal perspective on bias amplifica- tion in prediction & decision-making. Advances in Neural Information Processing Systems37, 84384–84408 (2024)

work page 2024
[22]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neu- ral networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911
[23]

Schwartz, R., Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., Hall, P.: Towards a standard for identifying and managing bias in artificial intelligence, vol. 3. US Department of Commerce, National Institute of Standards and Tech- nology ... (2022) Mitigating Bias Amplification via Inter-group Data Augmentation 15

work page 2022
[24]

ACM Computing Surveys55(13s), 1–39 (2023)

Shahbazi, N., Lin, Y., Asudeh, A., Jagadish, H.: Representation bias in data: A sur- vey on identification and resolution techniques. ACM Computing Surveys55(13s), 1–39 (2023)

work page 2023
[25]

arXiv preprint arXiv:2508.01491 (2025)

Sourati, Z., Ziabari, A.S., Dehghani, M.: The homogenizing effect of large language models on human expression and thought. arXiv preprint arXiv:2508.01491 (2025)

work page arXiv 2025
[26]

Advances in neural information processing systems4(1991)

Vapnik, V.: Principles of risk minimization for learning theory. Advances in neural information processing systems4(1991)

work page 1991
[27]

In: International Con- ference on Machine Learning

Wang, A., Russakovsky, O.: Directional bias amplification. In: International Con- ference on Machine Learning. pp. 10882–10893. PMLR (2021)

work page 2021
[28]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, Y., Ding, Z., Wu, X., Sun, S., Liu, N., Zhai, X.: Autoscore: Enhancing automated scoring with multi-agent large language models via structured compo- nent recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 40898–40906 (2026)

work page 2026
[29]

Assessing Writing18(1), 85–99 (2013)

Weigle, S.C.: English language learners and automated scoring of essays: Critical considerations. Assessing Writing18(1), 85–99 (2013)

work page 2013
[30]

Assessing writing66, 100979 (2025)

Welch, C., Dunbar, S., Ji, J., Vernon, A., Park, J.: Response time for english learners on large-scale writing assessments. Assessing writing66, 100979 (2025)

work page 2025
[31]

Wilson, J., Huang, Y.: Validity of automated essay scores for elementary-age en- glish language learners: Evidence of bias? Assessing Writing60, 100815 (2024)

work page 2024
[32]

arXiv preprint arXiv:2405.13001 (2024)

Xu, H., Gan, W., Qi, Z., Wu, J., Yu, P.S.: Large language models for education: A survey. arXiv preprint arXiv:2405.13001 (2024)

work page arXiv 2024
[33]

Studies in Educational Evaluation67, 100916 (2020)

Zhai, X., Haudek, K.C., Stuhlsatz, M.A., Wilson, C.: Evaluation of construct- irrelevant variance yielded by machine and human scoring of a science teacher pck constructed response assessment. Studies in Educational Evaluation67, 100916 (2020)

work page 2020
[34]

In: Hand- book of research on science education, pp

Zhai, X., Pellegrino, J.W.: Large-scale assessment in science education. In: Hand- book of research on science education, pp. 1045–1097. Routledge (2023)

work page 2023
[35]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Men also like shop- ping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

IEEE Transactions on Learning Technologies15(3), 364–375 (2022)

Zhu, X., Wu, H., Zhang, L.: Automatic short-answer grading via bert-based deep neural networks. IEEE Transactions on Learning Technologies15(3), 364–375 (2022). https://doi.org/10.1109/TLT.2022.3175537

work page doi:10.1109/tlt.2022.3175537 2022

[1] [1]

Teachers College Record108(11), 2282–2303 (2006)

Abedi,J.:Psychometricissuesintheellassessmentandspecialeducationeligibility. Teachers College Record108(11), 2282–2303 (2006)

work page 2006

[2] [2]

World journal of methodology13(5), 373 (2023)

Alkhawaldeh, I.M., Albalkhi, I., Naswhan, A.J.: Challenges and limitations of syn- thetic minority oversampling techniques in machine learning. World journal of methodology13(5), 373 (2023)

work page 2023

[3] [3]

American Educational Research Association, Washington, DC (2014)

American Educational Research Association, American Psychological Association, National Council on Measurement in Education: Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC (2014)

work page 2014

[4] [4]

International Journal of Artificial Intelligence in Education pp

Andersen, N., Mang, J., Goldhammer, F., Zehner, F.: Algorithmic fairness in au- tomatic short answer scoring. International Journal of Artificial Intelligence in Education pp. 1–38 (2025)

work page 2025

[5] [5]

arXiv preprint arXiv:2501.18845 (2025) 14 Y

Chai, Y., Xie, H., Qin, J.S.: Text data augmentation for large language models: A comprehensive survey of methods, challenges, and opportunities. arXiv preprint arXiv:2501.18845 (2025) 14 Y. Wang et al

work page arXiv 2025

[6] [6]

Chaudhari, R., Patel, M.: Deep learning in automated short answer grading: A comprehensivereview.In:ITMWebofConferences.vol.65,p.03003.EDPSciences (2024)

work page 2024

[7] [7]

Chen, D., Yin, C.: Data augmentation for intent classification (2022), https://arxiv.org/abs/2206.05790

work page arXiv 2022

[8] [8]

Journal of Research in Reading 35(2), 115–135 (2012)

Crossley, S.A., McNamara, D.S.: Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication. Journal of Research in Reading 35(2), 115–135 (2012)

work page 2012

[9] [9]

In: Proc

Devlin,J.,Chang,M.W.,Lee,K.,Toutanova,K.:Bert:Pre-trainingofdeepbidirec- tional transformers for language understanding. In: Proc. of NAACL-HLT (2019)

work page 2019

[10] [10]

Dikli, S., Bleyle, S.: Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing writing22, 1–17 (2014)

work page 2014

[11] [11]

Applied Sciences15(10), 5683 (2025)

Emirtekin, E.: Large language model-powered automated assessment: A systematic review. Applied Sciences15(10), 5683 (2025)

work page 2025

[12] [12]

In: International Conference on Artificial Intelligence in Education

Guo, S., Wang, Y., Yu, J., Wu, X., Ayik, B., Watts, F.M., Latif, E., Liu, N., Liu, L., Zhai, X.: Artificial intelligence bias on english language learners in automatic scoring. In: International Conference on Artificial Intelligence in Education. pp. 268–275. Springer (2025)

work page 2025

[13] [13]

In: International Conference on Machine Learning

Hashimoto, T., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demo- graphics in repeated loss minimization. In: International Conference on Machine Learning. pp. 1929–1938. PMLR (2018)

work page 1929

[14] [14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

arXiv preprint arXiv:1812.08999 (2018)

Leino, K., Black, E., Fredrikson, M., Sen, S., Datta, A.: Feature-wise bias amplifi- cation. arXiv preprint arXiv:1812.08999 (2018)

work page arXiv 2018

[16] [16]

arXiv preprint arXiv:2602.10388 (2026)

Li, Z., Wu, X., Li, Y., Hu, L., Liu, N.: Less is enough: Synthesizing diverse data in feature space of llms. arXiv preprint arXiv:2602.10388 (2026)

work page arXiv 2026

[17] [17]

ACM computing surveys (CSUR)54(6), 1–35 (2021)

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM computing surveys (CSUR)54(6), 1–35 (2021)

work page 2021

[18] [18]

ETS research report series1987(2), i–208 (1987)

Messick, S.: Validity. ETS research report series1987(2), i–208 (1987)

work page 1987

[19] [19]

Frontiers in Psychology13, 937097 (2022)

Palermo, C.: Rater characteristics, response content, and scoring contexts: Decom- posing the determinates of scoring accuracy. Frontiers in Psychology13, 937097 (2022)

work page 2022

[20] [20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Park,S.,Hong,Y.,Heo,B.,Yun,S.,Choi,J.Y.:Themajoritycanhelptheminority: Context-rich minority oversampling for long-tailed classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6887–6896 (2022)

work page 2022

[21] [21]

Advances in Neural Information Processing Systems37, 84384–84408 (2024)

Plecko, D., Bareinboim, E.: Mind the gap: A causal perspective on bias amplifica- tion in prediction & decision-making. Advances in Neural Information Processing Systems37, 84384–84408 (2024)

work page 2024

[22] [22]

Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neu- ral networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1911

[23] [23]

Schwartz, R., Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., Hall, P.: Towards a standard for identifying and managing bias in artificial intelligence, vol. 3. US Department of Commerce, National Institute of Standards and Tech- nology ... (2022) Mitigating Bias Amplification via Inter-group Data Augmentation 15

work page 2022

[24] [24]

ACM Computing Surveys55(13s), 1–39 (2023)

Shahbazi, N., Lin, Y., Asudeh, A., Jagadish, H.: Representation bias in data: A sur- vey on identification and resolution techniques. ACM Computing Surveys55(13s), 1–39 (2023)

work page 2023

[25] [25]

arXiv preprint arXiv:2508.01491 (2025)

Sourati, Z., Ziabari, A.S., Dehghani, M.: The homogenizing effect of large language models on human expression and thought. arXiv preprint arXiv:2508.01491 (2025)

work page arXiv 2025

[26] [26]

Advances in neural information processing systems4(1991)

Vapnik, V.: Principles of risk minimization for learning theory. Advances in neural information processing systems4(1991)

work page 1991

[27] [27]

In: International Con- ference on Machine Learning

Wang, A., Russakovsky, O.: Directional bias amplification. In: International Con- ference on Machine Learning. pp. 10882–10893. PMLR (2021)

work page 2021

[28] [28]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Wang, Y., Ding, Z., Wu, X., Sun, S., Liu, N., Zhai, X.: Autoscore: Enhancing automated scoring with multi-agent large language models via structured compo- nent recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 40898–40906 (2026)

work page 2026

[29] [29]

Assessing Writing18(1), 85–99 (2013)

Weigle, S.C.: English language learners and automated scoring of essays: Critical considerations. Assessing Writing18(1), 85–99 (2013)

work page 2013

[30] [30]

Assessing writing66, 100979 (2025)

Welch, C., Dunbar, S., Ji, J., Vernon, A., Park, J.: Response time for english learners on large-scale writing assessments. Assessing writing66, 100979 (2025)

work page 2025

[31] [31]

Wilson, J., Huang, Y.: Validity of automated essay scores for elementary-age en- glish language learners: Evidence of bias? Assessing Writing60, 100815 (2024)

work page 2024

[32] [32]

arXiv preprint arXiv:2405.13001 (2024)

Xu, H., Gan, W., Qi, Z., Wu, J., Yu, P.S.: Large language models for education: A survey. arXiv preprint arXiv:2405.13001 (2024)

work page arXiv 2024

[33] [33]

Studies in Educational Evaluation67, 100916 (2020)

Zhai, X., Haudek, K.C., Stuhlsatz, M.A., Wilson, C.: Evaluation of construct- irrelevant variance yielded by machine and human scoring of a science teacher pck constructed response assessment. Studies in Educational Evaluation67, 100916 (2020)

work page 2020

[34] [34]

In: Hand- book of research on science education, pp

Zhai, X., Pellegrino, J.W.: Large-scale assessment in science education. In: Hand- book of research on science education, pp. 1045–1097. Routledge (2023)

work page 2023

[35] [35]

Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints

Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Men also like shop- ping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[36] [36]

IEEE Transactions on Learning Technologies15(3), 364–375 (2022)

Zhu, X., Wu, H., Zhang, L.: Automatic short-answer grading via bert-based deep neural networks. IEEE Transactions on Learning Technologies15(3), 364–375 (2022). https://doi.org/10.1109/TLT.2022.3175537

work page doi:10.1109/tlt.2022.3175537 2022