Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom
Pith reviewed 2026-05-15 07:32 UTC · model grok-4.3
The pith
Data augmentation with ALP and EASE lets transformer models reach perfect precision and recall on the rarest categories when scoring student scientific explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On a set of 1,466 student responses scored on 11 binary analytic categories, augmentation with GPT-4 synthetic data, EASE word extraction, and ALP phrase extraction substantially lifts SciBERT performance over baseline fine-tuning; ALP reaches perfect precision, recall, and F1 on the most severely imbalanced categories while EASE raises alignment with human scores for every scientific and inaccurate idea category, all while avoiding the overfitting that simple oversampling produces.
What carries the argument
ALP phrase-level extraction and EASE word-level filtering, which generate targeted synthetic examples to balance the 11-category rubric while preserving conceptual content from real student responses.
If this is right
- Transformer models can classify the rarest advanced-reasoning categories with perfect accuracy after targeted augmentation.
- Augmentation retains novice-level data better than SMOTE-style oversampling and therefore supports learning-progression alignment.
- Automated scoring becomes practical for NGSS assessments that contain both complete and incomplete idea categories.
- The same strategies can be applied to other imbalanced educational rubrics without retraining from scratch.
Where Pith is reading between the lines
- These methods could reduce the volume of human-labeled data needed to train reliable educational scorers.
- Applying the augmentation pipeline to longitudinal student responses would test whether the scores track actual changes in reasoning over time.
- The gains may transfer to other transformer architectures or science topics with similar imbalance patterns.
Load-bearing premise
The synthetic responses and extracted phrases from GPT-4 and the extraction methods accurately reflect the distribution and conceptual content of actual student answers without introducing new biases.
What would settle it
A fresh collection of real student responses scored by humans in which the augmented model's predictions on the rare categories diverge from the human scores.
Figures
read the original abstract
Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augmentation strategies to improve transformer-based text classification of student responses to a physical science assessment based on an NGSS-aligned learning progression. The dataset consists of 1,466 high school responses scored on 11 binary-coded analytic categories. This rubric identifies six important components including scientific ideas needed for a complete explanation along with five common incomplete or inaccurate ideas. Using SciBERT as a baseline, we applied fine-tuning and test these augmentation strategies: (1) GPT-4--generated synthetic responses, (2) EASE, a word-level extraction and filtering approach, and (3) ALP (Augmentation using Lexicalized Probabilistic context-free grammar) phrase-level extraction. While fine-tuning SciBERT improved recall over baseline, augmentation substantially enhanced performance, with GPT data boosting both precision and recall, and ALP achieving perfect precision, recall, and F1 scores across most severe imbalanced categories (5,6,7 and 9). Across all rubric categories EASE augmentation substantially increased alignment with human scoring for both scientific ideas (Categories 1--6) and inaccurate ideas (Categories 7--11). We compared different augmentation strategies to a traditional oversampling method (SMOTE) in an effort to avoid overfitting and retain novice-level data critical for learning progression alignment. Findings demonstrate that targeted augmentation can address severe imbalance while preserving conceptual coverage, offering a scalable solution for automated learning progression-aligned scoring in science education.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that data augmentation strategies—GPT-4-generated synthetic responses, EASE word-level extraction and filtering, and ALP phrase-level extraction using lexicalized probabilistic context-free grammars—applied to fine-tuned SciBERT substantially improve classification performance on an imbalanced 11-category rubric for 1,466 high-school student scientific explanations. It reports that ALP yields perfect precision, recall, and F1 on the most severely imbalanced categories (5, 6, 7, 9), that EASE increases alignment with human scoring across scientific-idea and inaccurate-idea categories, and that these methods outperform SMOTE while preserving novice-level conceptual coverage.
Significance. If the synthetic data faithfully reproduces the lexical and conceptual distributions of real student responses, the work supplies a concrete, scalable route to learning-progression-aligned automated scoring in science education. The explicit comparison against SMOTE and the emphasis on retaining novice-level data are positive features that distinguish the contribution from generic oversampling studies.
major comments (2)
- [Results (ALP evaluation)] The headline result that ALP produces perfect P/R/F1 on categories 5, 6, 7 and 9 rests on the unverified assumption that GPT-4 generations and phrase-level extractions match the distribution of the original 1,466 real responses. No human validation of synthetic fidelity, no n-gram or embedding-overlap statistics, and no explicit statement that the test split remained strictly real-only are provided; without these, the perfect scores may reflect generation-style leakage rather than improved capture of learning-progression concepts.
- [Results] The claim that augmentation 'substantially increased alignment with human scoring' for categories 1–11 is presented without reported statistical significance tests, confidence intervals, or per-category confusion matrices that would allow readers to judge whether the gains exceed what would be expected from random variation on the small real test set.
minor comments (2)
- [Methods] The exact procedure for constructing the ALP grammar and the filtering thresholds used in EASE should be stated with sufficient detail (hyper-parameters, vocabulary size, generation temperature) to permit replication.
- [Experimental Setup] Clarify whether the 1,466 responses were split at the student or response level and whether any response-level leakage occurred when synthetic data were added to the training set.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and will make the indicated revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Results (ALP evaluation)] The headline result that ALP produces perfect P/R/F1 on categories 5, 6, 7 and 9 rests on the unverified assumption that GPT-4 generations and phrase-level extractions match the distribution of the original 1,466 real responses. No human validation of synthetic fidelity, no n-gram or embedding-overlap statistics, and no explicit statement that the test split remained strictly real-only are provided; without these, the perfect scores may reflect generation-style leakage rather than improved capture of learning-progression concepts.
Authors: We agree that the manuscript should explicitly address data splits and provide quantitative evidence of synthetic data fidelity. In the revision we will state that the test set consists exclusively of real student responses (synthetic data is used only for training). We will also add n-gram overlap statistics and embedding cosine similarity scores comparing synthetic and real responses to demonstrate distributional alignment. These additions will clarify that the reported perfect scores on categories 5, 6, 7, and 9 reflect improved learning of the target concepts rather than leakage. revision: yes
-
Referee: [Results] The claim that augmentation 'substantially increased alignment with human scoring' for categories 1–11 is presented without reported statistical significance tests, confidence intervals, or per-category confusion matrices that would allow readers to judge whether the gains exceed what would be expected from random variation on the small real test set.
Authors: We acknowledge that the original submission omitted statistical significance testing, confidence intervals, and confusion matrices. In the revised manuscript we will add McNemar’s tests (or appropriate paired tests) for the performance differences, report 95% confidence intervals for all precision/recall/F1 values, and include per-category confusion matrices for the baseline SciBERT and each augmentation condition. These changes will allow readers to assess whether the observed gains exceed random variation on the held-out real test set. revision: yes
Circularity Check
No circularity: empirical augmentation comparisons are self-contained
full rationale
The paper reports direct empirical results from fine-tuning SciBERT on real student responses augmented via GPT-4 generations, EASE word extraction, and ALP phrase extraction, then evaluating precision/recall/F1 on held-out real test data against SMOTE baselines. No equations, parameters, or claims reduce by construction to the inputs; performance metrics are computed independently from the augmentation process. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central results. The derivation chain consists solely of standard ML training and evaluation steps whose outputs are not definitionally equivalent to the augmentation choices.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Fine-tuning pre-trained transformer models like SciBERT improves performance on text classification tasks.
- ad hoc to paper Synthetic data generated by GPT-4 can accurately represent student responses in scientific explanations.
Reference graph
Works this paper leans on
-
[1]
Fang, L., Lee, G.-G., & Zhai, X. (2023). Using GPT-4 to augment unbalanced data for automatic scoring. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2023), pages 17 257–264.https://doi.org/10.18653/v1/2023.icnlsp-1.35
-
[2]
Rahman, A. M. M., Yin, W., & Wang, G. (2021). Data augmentation for text classification with EASE. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), pages 6665–6675.https: //aclanthology.org/2021.emnlp-main.536
work page 2021
-
[3]
Martin, P. P., & Graulic, N. (2024). Navigating the data frontier in science assessment: Advancing data augmentation strategies for machine learning applications with generative artificial intelligence, Computers and Education: Artificial Intelligence,https://doi.org/10.1016/j.caeai.2024.100265
-
[4]
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), pages 3615–3620.https: //aclanthology.org/D19-1371
work page 2019
-
[5]
Cochran, K., Cohn, C., Hastings, P., Tomuro, N., & Hughes, S. (2023). Using BERT to Identify Causal Structure in Students’ Scientific Explanations. International Journal of Artificial Intelligence in Education.https://doi.org/10.1007/s40593- 023-00373-y
-
[6]
C., Chu, Y., Gjokaj, A., Tartar, M., Tang, J., Krajcik, J., Haudek, K., & Kaldaras, L
Franovic, C., G. C., Chu, Y., Gjokaj, A., Tartar, M., Tang, J., Krajcik, J., Haudek, K., & Kaldaras, L. (in preparation). Learning progression-guided AI-human collaboration for rubric refinement and assessment of student knowledge-in-use
-
[7]
How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019
Chi Sun, Xipeng Qiu*, Yige Xu, Xuanjing Huang. How to Fine-Tune BERT for Text Classification?https://arxiv.org/ pdf/1905.05583
-
[8]
Wilson, M., Yao, S.-Y., & Osborne, J. (2024). Coordinating assessments with a learning progression. In H. Jin, D. Yan, & J. Krajcik (Eds.), Handbook of Research on Science Learning Progressions (1st ed., pp. 88–115). Routledge. https://doi.org/10.4324/9781003170785-7
-
[9]
P. Djagba, https://github.com/Prud11djagba/-Optimizing-AI-Scoring-of-Scientific-Explanations-Exploring- Augmentation-Strategies-
- [10]
-
[11]
A Survey on Data Augmentation for Text Classification
Markus Bayer, Marc-Andre Kaufhold and Christian Reuter, Peasec, "A Survey on Data Augmentation for Text Classification", ACM Computing Surveys, Vol. 55, No. 7, Article 146. 2022
work page 2022
-
[12]
Corcoran, T. B., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence-based approach to reform
work page 2009
-
[13]
Shorten, C.; and Khoshgoftaar, T. M. 2019. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1): 1–48
work page 2019
-
[14]
A. Cader. The potential for the use of deep neural networks in e-learning student evaluation with new data augmentation method. In Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, volume 12164, pages 37–42, 2018
work page 2020
-
[15]
T. H. Bell, C. Dartigues-Pallez, F. Jaillet, and C. Genolini. Data augmentation for enlarging student feature space and improving random forest success prediction. In Artificial Intelligence in Education. AIED 2021. Lecture Notes in Computer Science, vol 12749, pages 82–87, 2021
work page 2021
-
[16]
T. Zhou and H. Jiao. Data augmentation in machine learning for cheating detection in large-scale assessment: An illustration with the blending ensemble learning algorithm. Psychological Test and Assessment Modeling, 64(4):425–444, 2022
work page 2022
-
[17]
N. V Chawla, K. W Bowyer, L. O Hall, and W P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 321–357, 2002
work page 2002
-
[18]
Introducint chatgpt.https://openai.com/blog/chatgpt/(Retrieved October 2, 2023), 2022
OpenAI. Introducint chatgpt.https://openai.com/blog/chatgpt/(Retrieved October 2, 2023), 2022
work page 2023
-
[19]
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284
work page 2009
-
[20]
Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449
work page 2002
-
[21]
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.https://doi.org/10.1109/TKDE.2008.239
-
[22]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186).https://aclanthology.org/N19-1423 18
work page 2019
-
[23]
Mayfield, E., Adamson, D., & Rosé, C. P. (2019). Hierarchical neural models for assessing scientific argumentation. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2019) (pp. 138–148).https://aclanthology.org/W19-4416
work page 2019
-
[24]
Gweon, & Pellegrino, J. W. (2017). Automated scoring of constructed responses in science: Using natural language processing to evaluate scientific explanations. Educational Measurement: Issues and Practice, 36(2), 17–28.https: //doi.org/10.1111/emip.12146
-
[25]
Zhang, M., Dai, T., & Lester, J. (2020). Evaluating the explanatory depth of student responses using neural language models. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020)
work page 2020
-
[26]
Rahman, A. M. M., Yin, W., & Wang, G. (2021). Data augmentation for text classification with EASE. In Proceedings of EMNLP 2021 (pp. 6665–6675).https://aclanthology.org/2021.emnlp-main.536
work page 2021
-
[27]
Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. Frontiers in Education, 7.https://www.frontiersin. org/articles/10.3389/feduc.2022.983055
-
[28]
Zhai, X., Krajcik, J., & Pellegrino, J. W. (2021). On the Validity of Machine Learning-based Next Generation Science Assessments: A Validity Inferential Network. Journal of Science Education and Technology, 30(2), 298–312.https: //doi.org/10.1007/s10956-020-09879-9
-
[29]
Kaldaras, L., Akaeze, H., & Krajcik, J. (2021). Developing and validating Next Generation Science Standards-aligned learning progression to track three-dimensional learning of electrical interactions in high school physical science. Journal of Research in Science Teaching, 58(4), 589–618.https://doi.org/10.1002/tea.21672
-
[30]
Kaldaras, L., Haudek, K., & Krajcik, J. (2024). Employing automatic analysis tools aligned to learning progressions to assess knowledge application and support learning in STEM. International Journal of STEM Education, 11(1), 57. 19
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.