pith. sign in

arxiv: 2604.19754 · v1 · submitted 2026-03-21 · 💻 cs.AI · cs.LG

Exploring Data Augmentation and Resampling Strategies for Transformer-Based Models to Address Class Imbalance in AI Scoring of Scientific Explanations in NGSS Classroom

Pith reviewed 2026-05-15 07:32 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords data augmentationclass imbalanceautomated scoringscientific explanationstransformer modelsNGSSlearning progressionSciBERT
0
0 comments X

The pith

Data augmentation with ALP and EASE lets transformer models reach perfect precision and recall on the rarest categories when scoring student scientific explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests augmentation strategies to fix severe class imbalance when training AI to score high school students' written science explanations on an 11-category NGSS rubric. The authors start with SciBERT fine-tuning and then add GPT-4 synthetic responses, word-level EASE extraction, or phrase-level ALP extraction. These steps raise precision and recall across both scientific-idea and inaccurate-idea categories, with ALP hitting perfect scores on the most imbalanced ones and EASE improving overall human alignment better than standard oversampling. The approach matters because it could deliver immediate, learning-progression-aligned feedback in classrooms without losing coverage of novice ideas.

Core claim

On a set of 1,466 student responses scored on 11 binary analytic categories, augmentation with GPT-4 synthetic data, EASE word extraction, and ALP phrase extraction substantially lifts SciBERT performance over baseline fine-tuning; ALP reaches perfect precision, recall, and F1 on the most severely imbalanced categories while EASE raises alignment with human scores for every scientific and inaccurate idea category, all while avoiding the overfitting that simple oversampling produces.

What carries the argument

ALP phrase-level extraction and EASE word-level filtering, which generate targeted synthetic examples to balance the 11-category rubric while preserving conceptual content from real student responses.

If this is right

  • Transformer models can classify the rarest advanced-reasoning categories with perfect accuracy after targeted augmentation.
  • Augmentation retains novice-level data better than SMOTE-style oversampling and therefore supports learning-progression alignment.
  • Automated scoring becomes practical for NGSS assessments that contain both complete and incomplete idea categories.
  • The same strategies can be applied to other imbalanced educational rubrics without retraining from scratch.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These methods could reduce the volume of human-labeled data needed to train reliable educational scorers.
  • Applying the augmentation pipeline to longitudinal student responses would test whether the scores track actual changes in reasoning over time.
  • The gains may transfer to other transformer architectures or science topics with similar imbalance patterns.

Load-bearing premise

The synthetic responses and extracted phrases from GPT-4 and the extraction methods accurately reflect the distribution and conceptual content of actual student answers without introducing new biases.

What would settle it

A fresh collection of real student responses scored by humans in which the augmented model's predictions on the rare categories diverge from the human scores.

Figures

Figures reproduced from arXiv: 2604.19754 by Clare G.C. Franovic, Kevin Haudek, Leonora Kaldaras, Prudence Djagba.

Figure 1
Figure 1. Figure 1: Cart item. Question: When the wedges are removed, the cars will move. Predict which direction they will move and when they will stop. Justify your prediction. Level Brief Description 3 Student models and explanations represent causal relationships that integrate ideas of energy and Coulombic interactions at the atomic-molecular level to ex￾plain phenomena. 2 Student models and explanations represent causal… view at source ↗
Figure 2
Figure 2. Figure 2: Taxonomy and grouping for different data augmentation methods [11]. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
read the original abstract

Automated scoring of students' scientific explanations offers the potential for immediate, accurate feedback, yet class imbalance in rubric categories particularly those capturing advanced reasoning remains a challenge. This study investigates augmentation strategies to improve transformer-based text classification of student responses to a physical science assessment based on an NGSS-aligned learning progression. The dataset consists of 1,466 high school responses scored on 11 binary-coded analytic categories. This rubric identifies six important components including scientific ideas needed for a complete explanation along with five common incomplete or inaccurate ideas. Using SciBERT as a baseline, we applied fine-tuning and test these augmentation strategies: (1) GPT-4--generated synthetic responses, (2) EASE, a word-level extraction and filtering approach, and (3) ALP (Augmentation using Lexicalized Probabilistic context-free grammar) phrase-level extraction. While fine-tuning SciBERT improved recall over baseline, augmentation substantially enhanced performance, with GPT data boosting both precision and recall, and ALP achieving perfect precision, recall, and F1 scores across most severe imbalanced categories (5,6,7 and 9). Across all rubric categories EASE augmentation substantially increased alignment with human scoring for both scientific ideas (Categories 1--6) and inaccurate ideas (Categories 7--11). We compared different augmentation strategies to a traditional oversampling method (SMOTE) in an effort to avoid overfitting and retain novice-level data critical for learning progression alignment. Findings demonstrate that targeted augmentation can address severe imbalance while preserving conceptual coverage, offering a scalable solution for automated learning progression-aligned scoring in science education.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that data augmentation strategies—GPT-4-generated synthetic responses, EASE word-level extraction and filtering, and ALP phrase-level extraction using lexicalized probabilistic context-free grammars—applied to fine-tuned SciBERT substantially improve classification performance on an imbalanced 11-category rubric for 1,466 high-school student scientific explanations. It reports that ALP yields perfect precision, recall, and F1 on the most severely imbalanced categories (5, 6, 7, 9), that EASE increases alignment with human scoring across scientific-idea and inaccurate-idea categories, and that these methods outperform SMOTE while preserving novice-level conceptual coverage.

Significance. If the synthetic data faithfully reproduces the lexical and conceptual distributions of real student responses, the work supplies a concrete, scalable route to learning-progression-aligned automated scoring in science education. The explicit comparison against SMOTE and the emphasis on retaining novice-level data are positive features that distinguish the contribution from generic oversampling studies.

major comments (2)
  1. [Results (ALP evaluation)] The headline result that ALP produces perfect P/R/F1 on categories 5, 6, 7 and 9 rests on the unverified assumption that GPT-4 generations and phrase-level extractions match the distribution of the original 1,466 real responses. No human validation of synthetic fidelity, no n-gram or embedding-overlap statistics, and no explicit statement that the test split remained strictly real-only are provided; without these, the perfect scores may reflect generation-style leakage rather than improved capture of learning-progression concepts.
  2. [Results] The claim that augmentation 'substantially increased alignment with human scoring' for categories 1–11 is presented without reported statistical significance tests, confidence intervals, or per-category confusion matrices that would allow readers to judge whether the gains exceed what would be expected from random variation on the small real test set.
minor comments (2)
  1. [Methods] The exact procedure for constructing the ALP grammar and the filtering thresholds used in EASE should be stated with sufficient detail (hyper-parameters, vocabulary size, generation temperature) to permit replication.
  2. [Experimental Setup] Clarify whether the 1,466 responses were split at the student or response level and whether any response-level leakage occurred when synthetic data were added to the training set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will make the indicated revisions to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Results (ALP evaluation)] The headline result that ALP produces perfect P/R/F1 on categories 5, 6, 7 and 9 rests on the unverified assumption that GPT-4 generations and phrase-level extractions match the distribution of the original 1,466 real responses. No human validation of synthetic fidelity, no n-gram or embedding-overlap statistics, and no explicit statement that the test split remained strictly real-only are provided; without these, the perfect scores may reflect generation-style leakage rather than improved capture of learning-progression concepts.

    Authors: We agree that the manuscript should explicitly address data splits and provide quantitative evidence of synthetic data fidelity. In the revision we will state that the test set consists exclusively of real student responses (synthetic data is used only for training). We will also add n-gram overlap statistics and embedding cosine similarity scores comparing synthetic and real responses to demonstrate distributional alignment. These additions will clarify that the reported perfect scores on categories 5, 6, 7, and 9 reflect improved learning of the target concepts rather than leakage. revision: yes

  2. Referee: [Results] The claim that augmentation 'substantially increased alignment with human scoring' for categories 1–11 is presented without reported statistical significance tests, confidence intervals, or per-category confusion matrices that would allow readers to judge whether the gains exceed what would be expected from random variation on the small real test set.

    Authors: We acknowledge that the original submission omitted statistical significance testing, confidence intervals, and confusion matrices. In the revised manuscript we will add McNemar’s tests (or appropriate paired tests) for the performance differences, report 95% confidence intervals for all precision/recall/F1 values, and include per-category confusion matrices for the baseline SciBERT and each augmentation condition. These changes will allow readers to assess whether the observed gains exceed random variation on the held-out real test set. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical augmentation comparisons are self-contained

full rationale

The paper reports direct empirical results from fine-tuning SciBERT on real student responses augmented via GPT-4 generations, EASE word extraction, and ALP phrase extraction, then evaluating precision/recall/F1 on held-out real test data against SMOTE baselines. No equations, parameters, or claims reduce by construction to the inputs; performance metrics are computed independently from the augmentation process. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the central results. The derivation chain consists solely of standard ML training and evaluation steps whose outputs are not definitionally equivalent to the augmentation choices.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that augmentation preserves the distribution of student responses without introducing new biases.

axioms (2)
  • domain assumption Fine-tuning pre-trained transformer models like SciBERT improves performance on text classification tasks.
    Used as baseline and improved upon with augmentation.
  • ad hoc to paper Synthetic data generated by GPT-4 can accurately represent student responses in scientific explanations.
    Central to the GPT augmentation strategy.

pith-pipeline@v0.9.0 · 5612 in / 1418 out tokens · 48467 ms · 2026-05-15T07:32:53.149282+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Fang, L., Lee, G.-G., & Zhai, X. (2023). Using GPT-4 to augment unbalanced data for automatic scoring. In Proceedings of the 6th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2023), pages 17 257–264.https://doi.org/10.18653/v1/2023.icnlsp-1.35

  2. [2]

    Rahman, A. M. M., Yin, W., & Wang, G. (2021). Data augmentation for text classification with EASE. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), pages 6665–6675.https: //aclanthology.org/2021.emnlp-main.536

  3. [3]

    P., & Graulic, N

    Martin, P. P., & Graulic, N. (2024). Navigating the data frontier in science assessment: Advancing data augmentation strategies for machine learning applications with generative artificial intelligence, Computers and Education: Artificial Intelligence,https://doi.org/10.1016/j.caeai.2024.100265

  4. [4]

    Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP 2019), pages 3615–3620.https: //aclanthology.org/D19-1371

  5. [5]

    Cochran, K., Cohn, C., Hastings, P., Tomuro, N., & Hughes, S. (2023). Using BERT to Identify Causal Structure in Students’ Scientific Explanations. International Journal of Artificial Intelligence in Education.https://doi.org/10.1007/s40593- 023-00373-y

  6. [6]

    C., Chu, Y., Gjokaj, A., Tartar, M., Tang, J., Krajcik, J., Haudek, K., & Kaldaras, L

    Franovic, C., G. C., Chu, Y., Gjokaj, A., Tartar, M., Tang, J., Krajcik, J., Haudek, K., & Kaldaras, L. (in preparation). Learning progression-guided AI-human collaboration for rubric refinement and assessment of student knowledge-in-use

  7. [7]

    How to fine-tune bert for text classification?arXiv preprint arXiv:1905.05583, 2019

    Chi Sun, Xipeng Qiu*, Yige Xu, Xuanjing Huang. How to Fine-Tune BERT for Text Classification?https://arxiv.org/ pdf/1905.05583

  8. [8]

    Wilson, M., Yao, S.-Y., & Osborne, J. (2024). Coordinating assessments with a learning progression. In H. Jin, D. Yan, & J. Krajcik (Eds.), Handbook of Research on Science Learning Progressions (1st ed., pp. 88–115). Routledge. https://doi.org/10.4324/9781003170785-7

  9. [9]

    Djagba, https://github.com/Prud11djagba/-Optimizing-AI-Scoring-of-Scientific-Explanations-Exploring- Augmentation-Strategies-

    P. Djagba, https://github.com/Prud11djagba/-Optimizing-AI-Scoring-of-Scientific-Explanations-Exploring- Augmentation-Strategies-

  10. [10]

    ALP: Data Augmentation using Lexicalized PCFGs for Few-Shot Text Classification, Hazel Kim, Daecheol Woo, Seong Joon Oh, Jeong-Won Cha, Yo-Sub Han,https://arxiv.org/abs/2112.11916, 2021

  11. [11]

    A Survey on Data Augmentation for Text Classification

    Markus Bayer, Marc-Andre Kaufhold and Christian Reuter, Peasec, "A Survey on Data Augmentation for Text Classification", ACM Computing Surveys, Vol. 55, No. 7, Article 146. 2022

  12. [12]

    B., Mosher, F

    Corcoran, T. B., Mosher, F. A., & Rogat, A. (2009). Learning progressions in science: An evidence-based approach to reform

  13. [13]

    Shorten, C.; and Khoshgoftaar, T. M. 2019. A survey on image data augmentation for deep learning. Journal of Big Data, 6(1): 1–48

  14. [14]

    A. Cader. The potential for the use of deep neural networks in e-learning student evaluation with new data augmentation method. In Artificial Intelligence in Education. AIED 2020. Lecture Notes in Computer Science, volume 12164, pages 37–42, 2018

  15. [15]

    T. H. Bell, C. Dartigues-Pallez, F. Jaillet, and C. Genolini. Data augmentation for enlarging student feature space and improving random forest success prediction. In Artificial Intelligence in Education. AIED 2021. Lecture Notes in Computer Science, vol 12749, pages 82–87, 2021

  16. [16]

    Zhou and H

    T. Zhou and H. Jiao. Data augmentation in machine learning for cheating detection in large-scale assessment: An illustration with the blending ensemble learning algorithm. Psychological Test and Assessment Modeling, 64(4):425–444, 2022

  17. [17]

    V Chawla, K

    N. V Chawla, K. W Bowyer, L. O Hall, and W P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16: 321–357, 2002

  18. [18]

    Introducint chatgpt.https://openai.com/blog/chatgpt/(Retrieved October 2, 2023), 2022

    OpenAI. Introducint chatgpt.https://openai.com/blog/chatgpt/(Retrieved October 2, 2023), 2022

  19. [19]

    He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284

  20. [20]

    Japkowicz, N., & Stephen, S. (2002). The class imbalance problem: A systematic study. Intelligent Data Analysis, 6(5), 429–449

  21. [21]

    He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.https://doi.org/10.1109/TKDE.2008.239

  22. [22]

    Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186).https://aclanthology.org/N19-1423 18

  23. [23]

    Mayfield, E., Adamson, D., & Rosé, C. P. (2019). Hierarchical neural models for assessing scientific argumentation. In Proceedings of the 14th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2019) (pp. 138–148).https://aclanthology.org/W19-4416

  24. [24]

    Gweon, & Pellegrino, J. W. (2017). Automated scoring of constructed responses in science: Using natural language processing to evaluate scientific explanations. Educational Measurement: Issues and Practice, 36(2), 17–28.https: //doi.org/10.1111/emip.12146

  25. [25]

    Zhang, M., Dai, T., & Lester, J. (2020). Evaluating the explanatory depth of student responses using neural language models. In Proceedings of the 13th International Conference on Educational Data Mining (EDM 2020)

  26. [26]

    Rahman, A. M. M., Yin, W., & Wang, G. (2021). Data augmentation for text classification with EASE. In Proceedings of EMNLP 2021 (pp. 6665–6675).https://aclanthology.org/2021.emnlp-main.536

  27. [27]

    R., & Haudek, K

    Kaldaras, L., Yoshida, N. R., & Haudek, K. C. (2022). Rubric development for AI-enabled scoring of three-dimensional constructed-response assessment aligned to NGSS learning progression. Frontiers in Education, 7.https://www.frontiersin. org/articles/10.3389/feduc.2022.983055

  28. [28]

    Zhai, X., Krajcik, J., & Pellegrino, J. W. (2021). On the Validity of Machine Learning-based Next Generation Science Assessments: A Validity Inferential Network. Journal of Science Education and Technology, 30(2), 298–312.https: //doi.org/10.1007/s10956-020-09879-9

  29. [29]

    Kaldaras, L., Akaeze, H., & Krajcik, J. (2021). Developing and validating Next Generation Science Standards-aligned learning progression to track three-dimensional learning of electrical interactions in high school physical science. Journal of Research in Science Teaching, 58(4), 589–618.https://doi.org/10.1002/tea.21672

  30. [30]

    Kaldaras, L., Haudek, K., & Krajcik, J. (2024). Employing automatic analysis tools aligned to learning progressions to assess knowledge application and support learning in STEM. International Journal of STEM Education, 11(1), 57. 19