BRIDGE the Gap: Mitigating Bias Amplification in Automated Scoring of English Language Learners via Inter-group Data Augmentation
Pith reviewed 2026-05-15 19:26 UTC · model grok-4.3
The pith
A data synthesis method reduces bias against high-scoring English Language Learners in automated test scoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BRIDGE synthesizes high-scoring ELL samples by pasting rubric-aligned knowledge and evidence from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns, validated by a discriminator model. This mitigates representation bias in automated scoring systems trained on imbalanced data.
What carries the argument
BRIDGE, a Bias-Reducing Inter-group Data GEneration framework that pastes construct-relevant content from non-ELL into ELL linguistic patterns, guarded by a discriminator.
If this is right
- Prediction bias for high-scoring ELL students is reduced on CAST datasets.
- Overall scoring performance is maintained.
- Fairness gains are comparable to those from using additional real human data.
- The method offers a cost-effective alternative for equitable scoring in large-scale assessments.
Where Pith is reading between the lines
- Similar inter-group augmentation could apply to other imbalanced domains like medical diagnosis or hiring algorithms where minority groups use different expression styles.
- Future work might test if the discriminator can be replaced with human evaluation or more advanced quality checks.
- The separation of content and language assumes rubrics capture construct-relevant parts cleanly, which may need validation per subject.
Load-bearing premise
Construct-relevant content can be cleanly separated from linguistic patterns and pasted into ELL responses without introducing new artifacts or biases that the discriminator misses.
What would settle it
A human evaluation study where raters judge whether synthetic ELL samples contain unnatural artifacts or altered meaning compared to real ones, or an experiment showing no bias reduction when applied.
Figures
read the original abstract
In the field of educational assessment, automated scoring systems increasingly rely on deep learning and large language models (LLMs). However, these systems face significant risks of bias amplification, where model prediction gaps between student groups become larger than those observed in training data. This issue is especially severe for underrepresented groups such as English Language Learners (ELLs), as models may inherit and further magnify existing disparities in the data. We identify that this issue is closely tied to representation bias: the scarcity of minority (high-scoring ELL) samples makes models trained with empirical risk minimization favor majority (non-ELL) linguistic patterns. Consequently, models tend to under-predict ELL students who even demonstrate comparable domain knowledge but use different linguistic patterns, thereby undermining the fairness of automated scoring outcomes. To mitigate this, we propose BRIDGE, a Bias-Reducing Inter-group Data GEneration framework designed for low-resource assessment settings. Instead of relying on the limited minority samples, BRIDGE synthesizes high-scoring ELL samples by "pasting" construct-relevant (i.e., rubric-aligned knowledge and evidence) content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns. We further introduce a discriminator model to ensure the quality of synthetic samples. Experiments on California Science Test (CAST) datasets demonstrate that BRIDGE effectively reduces prediction bias for high-scoring ELL students while maintaining overall scoring performance. Notably, our method achieves fairness gains comparable to using additional real human data, offering a cost-effective solution for ensuring equitable scoring in large-scale assessments.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes BRIDGE, a Bias-Reducing Inter-group Data GEneration framework that synthesizes high-scoring ELL samples by transferring construct-relevant content from abundant high-scoring non-ELL responses into authentic ELL linguistic patterns, augmented by a discriminator for quality control. Experiments on California Science Test (CAST) datasets claim that BRIDGE reduces prediction bias for high-scoring ELL students while preserving overall scoring performance, achieving fairness gains comparable to adding real human data.
Significance. If the synthetic samples preserve rubric alignment and the discriminator reliably filters artifacts, BRIDGE could offer a practical, low-cost approach to addressing representation bias in automated scoring systems for underrepresented student groups in large-scale educational assessments.
major comments (2)
- [Method] The central methodological assumption—that construct-relevant content can be cleanly extracted from non-ELL responses and pasted into ELL linguistic frames without introducing new biases or validity artifacts—is load-bearing for the fairness claim but receives no direct empirical test or ablation (e.g., no comparison of rubric scores or discriminator failure modes on entangled content-language cases).
- [Experiments] The abstract asserts that fairness gains are comparable to real human data, yet no concrete bias metrics (e.g., group-wise prediction gaps, demographic parity), baseline models, statistical tests, data-split details, or exclusion criteria are reported, rendering the experimental support for the central claim unverifiable.
minor comments (1)
- [Abstract] The term 'pasting' is used informally in the abstract and method overview; a precise algorithmic description or pseudocode would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We have carefully considered the comments and provide point-by-point responses below. We believe the suggested revisions will improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Method] The central methodological assumption—that construct-relevant content can be cleanly extracted from non-ELL responses and pasted into ELL linguistic frames without introducing new biases or validity artifacts—is load-bearing for the fairness claim but receives no direct empirical test or ablation (e.g., no comparison of rubric scores or discriminator failure modes on entangled content-language cases).
Authors: We acknowledge that the assumption regarding clean extraction and pasting of construct-relevant content is central to our approach and would benefit from more direct empirical scrutiny. While the original manuscript demonstrates the effectiveness through end-to-end performance metrics and the use of a quality discriminator, it does not include a dedicated ablation study on entangled content-language scenarios or direct rubric score comparisons for synthetic samples. In the revised version, we will incorporate an additional ablation experiment that evaluates rubric alignment of synthetic samples via expert annotation and analyzes cases where the discriminator may fail on highly entangled inputs. This will provide stronger evidence for the validity of the generated samples. revision: yes
-
Referee: [Experiments] The abstract asserts that fairness gains are comparable to real human data, yet no concrete bias metrics (e.g., group-wise prediction gaps, demographic parity), baseline models, statistical tests, data-split details, or exclusion criteria are reported, rendering the experimental support for the central claim unverifiable.
Authors: We agree with the referee that the experimental section would benefit from more explicit reporting of concrete bias metrics, baselines, and setup details to support the claim of comparability to real human data. In the revised manuscript, we will add specific quantitative results, such as the exact group-wise prediction gaps before and after applying BRIDGE, comparisons to baselines, statistical significance tests, detailed data-split information, and exclusion criteria. We will also update the abstract to reference these key findings. revision: yes
Circularity Check
No significant circularity in empirical data augmentation framework
full rationale
The paper presents BRIDGE as an empirical data-generation method that synthesizes samples by pasting construct-relevant content from non-ELL responses into ELL linguistic frames, then applies a discriminator for quality control. All claims are supported by experiments on external California Science Test (CAST) datasets, with fairness metrics compared directly to real human data baselines. No equations, derivations, or self-referential quantities appear in the provided text; the method does not reduce any prediction or result to a fitted parameter or self-citation by construction. The framework is externally validated rather than internally self-consistent by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Construct-relevant content from non-ELL responses can be isolated and pasted into ELL linguistic patterns while preserving scoring validity.
invented entities (1)
-
BRIDGE data generation framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BRIDGE synthesizes high-scoring ELL samples by 'pasting' construct-relevant content from abundant high-scoring non-ELL samples into authentic ELL linguistic patterns
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BiasAmp(f) = Δ_model − Δ_data with decomposition into FP/FN terms
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Teachers College Record108(11), 2282–2303 (2006)
Abedi,J.:Psychometricissuesintheellassessmentandspecialeducationeligibility. Teachers College Record108(11), 2282–2303 (2006)
work page 2006
-
[2]
World journal of methodology13(5), 373 (2023)
Alkhawaldeh, I.M., Albalkhi, I., Naswhan, A.J.: Challenges and limitations of syn- thetic minority oversampling techniques in machine learning. World journal of methodology13(5), 373 (2023)
work page 2023
-
[3]
American Educational Research Association, Washington, DC (2014)
American Educational Research Association, American Psychological Association, National Council on Measurement in Education: Standards for Educational and Psychological Testing. American Educational Research Association, Washington, DC (2014)
work page 2014
-
[4]
International Journal of Artificial Intelligence in Education pp
Andersen, N., Mang, J., Goldhammer, F., Zehner, F.: Algorithmic fairness in au- tomatic short answer scoring. International Journal of Artificial Intelligence in Education pp. 1–38 (2025)
work page 2025
-
[5]
arXiv preprint arXiv:2501.18845 (2025) 14 Y
Chai, Y., Xie, H., Qin, J.S.: Text data augmentation for large language models: A comprehensive survey of methods, challenges, and opportunities. arXiv preprint arXiv:2501.18845 (2025) 14 Y. Wang et al
-
[6]
Chaudhari, R., Patel, M.: Deep learning in automated short answer grading: A comprehensivereview.In:ITMWebofConferences.vol.65,p.03003.EDPSciences (2024)
work page 2024
- [7]
-
[8]
Journal of Research in Reading 35(2), 115–135 (2012)
Crossley, S.A., McNamara, D.S.: Predicting second language writing proficiency: The roles of cohesion and linguistic sophistication. Journal of Research in Reading 35(2), 115–135 (2012)
work page 2012
- [9]
-
[10]
Dikli, S., Bleyle, S.: Automated essay scoring feedback for second language writers: How does it compare to instructor feedback? Assessing writing22, 1–17 (2014)
work page 2014
-
[11]
Applied Sciences15(10), 5683 (2025)
Emirtekin, E.: Large language model-powered automated assessment: A systematic review. Applied Sciences15(10), 5683 (2025)
work page 2025
-
[12]
In: International Conference on Artificial Intelligence in Education
Guo, S., Wang, Y., Yu, J., Wu, X., Ayik, B., Watts, F.M., Latif, E., Liu, N., Liu, L., Zhai, X.: Artificial intelligence bias on english language learners in automatic scoring. In: International Conference on Artificial Intelligence in Education. pp. 268–275. Springer (2025)
work page 2025
-
[13]
In: International Conference on Machine Learning
Hashimoto, T., Srivastava, M., Namkoong, H., Liang, P.: Fairness without demo- graphics in repeated loss minimization. In: International Conference on Machine Learning. pp. 1929–1938. PMLR (2018)
work page 1929
-
[14]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
arXiv preprint arXiv:1812.08999 (2018)
Leino, K., Black, E., Fredrikson, M., Sen, S., Datta, A.: Feature-wise bias amplifi- cation. arXiv preprint arXiv:1812.08999 (2018)
-
[16]
arXiv preprint arXiv:2602.10388 (2026)
Li, Z., Wu, X., Li, Y., Hu, L., Liu, N.: Less is enough: Synthesizing diverse data in feature space of llms. arXiv preprint arXiv:2602.10388 (2026)
-
[17]
ACM computing surveys (CSUR)54(6), 1–35 (2021)
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., Galstyan, A.: A survey on bias and fairness in machine learning. ACM computing surveys (CSUR)54(6), 1–35 (2021)
work page 2021
-
[18]
ETS research report series1987(2), i–208 (1987)
Messick, S.: Validity. ETS research report series1987(2), i–208 (1987)
work page 1987
-
[19]
Frontiers in Psychology13, 937097 (2022)
Palermo, C.: Rater characteristics, response content, and scoring contexts: Decom- posing the determinates of scoring accuracy. Frontiers in Psychology13, 937097 (2022)
work page 2022
-
[20]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Park,S.,Hong,Y.,Heo,B.,Yun,S.,Choi,J.Y.:Themajoritycanhelptheminority: Context-rich minority oversampling for long-tailed classification. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6887–6896 (2022)
work page 2022
-
[21]
Advances in Neural Information Processing Systems37, 84384–84408 (2024)
Plecko, D., Bareinboim, E.: Mind the gap: A causal perspective on bias amplifica- tion in prediction & decision-making. Advances in Neural Information Processing Systems37, 84384–84408 (2024)
work page 2024
-
[22]
Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neu- ral networks for group shifts: On the importance of regularization for worst-case generalization. arXiv preprint arXiv:1911.08731 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[23]
Schwartz, R., Schwartz, R., Vassilev, A., Greene, K., Perine, L., Burt, A., Hall, P.: Towards a standard for identifying and managing bias in artificial intelligence, vol. 3. US Department of Commerce, National Institute of Standards and Tech- nology ... (2022) Mitigating Bias Amplification via Inter-group Data Augmentation 15
work page 2022
-
[24]
ACM Computing Surveys55(13s), 1–39 (2023)
Shahbazi, N., Lin, Y., Asudeh, A., Jagadish, H.: Representation bias in data: A sur- vey on identification and resolution techniques. ACM Computing Surveys55(13s), 1–39 (2023)
work page 2023
-
[25]
arXiv preprint arXiv:2508.01491 (2025)
Sourati, Z., Ziabari, A.S., Dehghani, M.: The homogenizing effect of large language models on human expression and thought. arXiv preprint arXiv:2508.01491 (2025)
-
[26]
Advances in neural information processing systems4(1991)
Vapnik, V.: Principles of risk minimization for learning theory. Advances in neural information processing systems4(1991)
work page 1991
-
[27]
In: International Con- ference on Machine Learning
Wang, A., Russakovsky, O.: Directional bias amplification. In: International Con- ference on Machine Learning. pp. 10882–10893. PMLR (2021)
work page 2021
-
[28]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Wang, Y., Ding, Z., Wu, X., Sun, S., Liu, N., Zhai, X.: Autoscore: Enhancing automated scoring with multi-agent large language models via structured compo- nent recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 40898–40906 (2026)
work page 2026
-
[29]
Assessing Writing18(1), 85–99 (2013)
Weigle, S.C.: English language learners and automated scoring of essays: Critical considerations. Assessing Writing18(1), 85–99 (2013)
work page 2013
-
[30]
Assessing writing66, 100979 (2025)
Welch, C., Dunbar, S., Ji, J., Vernon, A., Park, J.: Response time for english learners on large-scale writing assessments. Assessing writing66, 100979 (2025)
work page 2025
-
[31]
Wilson, J., Huang, Y.: Validity of automated essay scores for elementary-age en- glish language learners: Evidence of bias? Assessing Writing60, 100815 (2024)
work page 2024
-
[32]
arXiv preprint arXiv:2405.13001 (2024)
Xu, H., Gan, W., Qi, Z., Wu, J., Yu, P.S.: Large language models for education: A survey. arXiv preprint arXiv:2405.13001 (2024)
-
[33]
Studies in Educational Evaluation67, 100916 (2020)
Zhai, X., Haudek, K.C., Stuhlsatz, M.A., Wilson, C.: Evaluation of construct- irrelevant variance yielded by machine and human scoring of a science teacher pck constructed response assessment. Studies in Educational Evaluation67, 100916 (2020)
work page 2020
-
[34]
In: Hand- book of research on science education, pp
Zhai, X., Pellegrino, J.W.: Large-scale assessment in science education. In: Hand- book of research on science education, pp. 1045–1097. Routledge (2023)
work page 2023
-
[35]
Men Also Like Shopping: Reducing Gender Bias Amplification using Corpus-level Constraints
Zhao, J., Wang, T., Yatskar, M., Ordonez, V., Chang, K.W.: Men also like shop- ping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[36]
IEEE Transactions on Learning Technologies15(3), 364–375 (2022)
Zhu, X., Wu, H., Zhang, L.: Automatic short-answer grading via bert-based deep neural networks. IEEE Transactions on Learning Technologies15(3), 364–375 (2022). https://doi.org/10.1109/TLT.2022.3175537
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.