Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

Duy Anh Nguyen

arxiv: 2605.25924 · v1 · pith:5L7LQVUFnew · submitted 2026-05-25 · 💻 cs.CL · cs.LG

Does Continued Pretraining on a Learner Corpus Improve Automated Essay Scoring on English Proficiency Tests? Evidence from EFCAMDAT

Duy Anh Nguyen This is my paper

Pith reviewed 2026-06-29 21:58 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords automated essay scoringdomain adaptive pretraininglearner corpusCEFR alignmenttransformer modelsEnglish proficiency testscross-dataset transfer

0 comments

The pith

Targeted continued pretraining on CEFR-aligned learner writing improves in-domain automated essay scoring but does not reliably boost cross-dataset transfer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether domain-adaptive continued pretraining on the EFCAMDAT learner corpus helps transformer models score essays from English proficiency tests such as FCE and IELTS. Full-corpus pretraining yields inconsistent gains, which the authors trace to mismatches in proficiency level, genre, and purpose between the pretraining data and the test sets. A proficiency-based ablation then shows that restricting pretraining to CEFR-matched subsets produces steadier improvements on matching test data, especially B1-B2 material for FCE, yet these improvements rarely carry over to other test datasets. The central result is therefore that alignment between pretraining and downstream assessment settings matters for in-domain performance, while automatic transfer across different proficiency tests does not follow.

Core claim

Domain-adaptive continued pretraining on the full EFCAMDAT learner corpus produces mixed effects on transformer-based automated essay scoring for FCE and IELTS. When pretraining data are instead restricted to CEFR-aligned subsets that match the target test's proficiency band, downstream scoring improves more consistently on in-domain tasks, with the clearest gains for FCE using B1-B2 data. These targeted gains, however, do not reliably enhance few-shot cross-dataset transfer between FCE and IELTS.

What carries the argument

Proficiency-based ablation that selects CEFR-aligned subsets of EFCAMDAT for continued pretraining, compared against full-corpus DAPT on three transformer encoders.

If this is right

AES systems for a specific proficiency test can be improved by continued pretraining on learner data drawn from the same CEFR band.
Full-corpus learner pretraining is unlikely to be the default recipe for every downstream scoring task.
Cross-dataset transfer between different English exams will require additional techniques beyond simple continued pretraining on any single learner corpus.
Alignment checks on proficiency, genre, and purpose should precede any decision to apply DAPT to an AES model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If alignment is the decisive factor, then constructing small, purpose-matched learner subsets may be more efficient than scaling up general learner corpora.
The finding suggests that future AES work could treat continued pretraining as a data-selection problem rather than a scale problem.
One testable extension is to measure whether adding genre or purpose filters on top of CEFR alignment further stabilizes transfer between exams.

Load-bearing premise

The mixed results of full-corpus DAPT are caused by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the FCE or IELTS test sets.

What would settle it

A controlled experiment that applies the same CEFR-subset pretraining to a new English proficiency test whose genre and communicative purpose are deliberately mismatched with EFCAMDAT and measures whether in-domain gains disappear.

Figures

Figures reproduced from arXiv: 2605.25924 by Duy Anh Nguyen.

**Figure 2.** Figure 2: Cross-dataset transfer setup. For each encoder architecture, both the domain-adapted model and its [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Validation MLM loss of BERT, RoBERTa, and DistilBERT during continued pretraining on EFCAMDAT. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Recent automated essay scoring (AES) studies increasingly use pretrained transformer models, but these models are usually pretrained on general-domain English and may under-represent second-language learner writing. This study investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based AES for English proficiency tests. We apply DAPT to three transformer encoders and evaluate them on FCE and IELTS in both in-domain scoring and few-shot cross-dataset transfer. Full-corpus DAPT produces mixed results across models, datasets, and metrics. Further analyses suggest that these mixed effects are partly explained by mismatches in proficiency, genre, and communicative purpose between EFCAMDAT and the downstream datasets. A proficiency-based ablation shows that targeted DAPT using CEFR-aligned subsets improves downstream scoring more reliably than full-corpus DAPT, especially for FCE with B1--B2 data. However, these gains do not consistently improve cross-dataset transfer. Overall, the findings suggest that continued pretraining on a learner-writing corpus can benefit in-domain AES for English assessment when the pretraining data is sufficiently aligned with the downstream assessment settings. However, it does not automatically improve transferability across different English proficiency test datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The proficiency ablation is the clearest new piece, but without size-matched controls the claim that alignment drives the gains rests on shaky ground.

read the letter

The paper checks whether DAPT on EFCAMDAT helps transformer AES models score FCE and IELTS essays. Full-corpus pretraining gives mixed results across models and metrics. Switching to CEFR-aligned subsets (B1-B2 for FCE) produces more consistent in-domain gains, though transfer across datasets stays unreliable.

The new element is the targeted ablation on this learner corpus with the proficiency split. The work is honest about the mixed full-corpus outcomes and follows up with further analysis on mismatches in level, genre, and purpose. That transparency is useful.

The main weakness is the ablation design. CEFR subsets are smaller than the full corpus, so any improvement could simply reflect less exposure to mismatched text rather than positive alignment. The paper does not report size-matched random subsamples or exact token counts per condition, which leaves the mismatch explanation under-supported. Basic statistical details and baseline deltas are also thin in the abstract, so the full text needs to supply those.

This is for researchers working on domain-adapted AES for language tests. It offers a narrow but concrete data point on when continued pretraining helps in-domain scoring.

Send it to review. The question is practical, the experiments are direct, and the mixed findings are worth referee scrutiny even if the controls need work.

Referee Report

2 major / 1 minor

Summary. The paper investigates whether domain-adaptive continued pretraining (DAPT) on the EFCAMDAT learner corpus improves transformer-based automated essay scoring (AES) for English proficiency tests (FCE and IELTS). Full-corpus DAPT yields mixed results across models, datasets, and metrics. Further analyses attribute these to mismatches in proficiency, genre, and communicative purpose. A proficiency-based ablation finds that targeted DAPT on CEFR-aligned subsets improves in-domain scoring more reliably than full-corpus DAPT (especially FCE with B1-B2 data), though gains do not consistently aid cross-dataset transfer. The conclusion is that aligned learner data can benefit in-domain AES but does not automatically improve transferability.

Significance. If the central empirical findings hold after controls, the work provides concrete evidence on the conditions for successful DAPT in educational NLP, showing that proficiency alignment can outperform untargeted pretraining on learner corpora for AES. It contributes to the literature on domain adaptation by highlighting that data volume alone does not explain outcomes and that targeted subsets matter for in-domain performance.

major comments (2)

[proficiency-based ablation / results section] The proficiency-based ablation (described in the abstract and results) claims that CEFR-aligned subsets improve downstream AES more reliably than full-corpus DAPT. However, CEFR subsets are necessarily smaller; without explicit size-matched controls (e.g., random subsamples of equal token count) or reporting of exact pretraining data volumes per condition, the observed gains on FCE could arise from reduced exposure to mismatched data rather than positive alignment effects. This directly undercuts the inference that mismatches explain the mixed full-corpus results.
[abstract and results] The abstract and further analyses report mixed results for full-corpus DAPT and attribute them to mismatches in proficiency/genre/purpose, but provide no details on statistical significance tests, exact metric deltas, or full baseline comparisons. This makes it difficult to evaluate the reliability and magnitude of the reported effects supporting the central claim.

minor comments (1)

[abstract] The abstract could more explicitly name the three transformer encoders used and the specific metrics (e.g., QWK, Pearson) for which mixed results were observed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate where revisions will be made to address the concerns.

read point-by-point responses

Referee: [proficiency-based ablation / results section] The proficiency-based ablation (described in the abstract and results) claims that CEFR-aligned subsets improve downstream AES more reliably than full-corpus DAPT. However, CEFR subsets are necessarily smaller; without explicit size-matched controls (e.g., random subsamples of equal token count) or reporting of exact pretraining data volumes per condition, the observed gains on FCE could arise from reduced exposure to mismatched data rather than positive alignment effects. This directly undercuts the inference that mismatches explain the mixed full-corpus results.

Authors: We agree this is a valid concern and a limitation of the current analysis. The manuscript reports token counts per condition but does not include size-matched random subsample controls from the full corpus. It is possible the observed gains reflect reduced exposure to mismatched data. We will add these controls via additional experiments on random subsets matched in token count to the B1-B2 CEFR subset and report the comparative results to better isolate alignment effects from data volume reduction. revision: yes
Referee: [abstract and results] The abstract and further analyses report mixed results for full-corpus DAPT and attribute them to mismatches in proficiency/genre/purpose, but provide no details on statistical significance tests, exact metric deltas, or full baseline comparisons. This makes it difficult to evaluate the reliability and magnitude of the reported effects supporting the central claim.

Authors: The full manuscript contains tables with all model conditions and baselines (no-DAPT). However, the abstract and narrative results do not include statistical significance tests or explicit delta values. We will revise both sections to report paired significance tests (e.g., t-tests), exact metric deltas, and ensure all baseline comparisons are highlighted with effect magnitudes. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical ablation study

full rationale

The paper reports experimental results from domain-adaptive pretraining (DAPT) on EFCAMDAT subsets followed by AES evaluation on FCE and IELTS. No equations, derivations, predictions from first principles, or parameter-fitting steps are present that could reduce to inputs by construction. All claims rest on direct empirical comparisons of model performance metrics across conditions; the proficiency-based ablation is a controlled experiment, not a self-referential definition or fitted-input prediction. No self-citation chains or uniqueness theorems are invoked as load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in NLP about the benefits of continued pretraining when domains align and on the validity of AES evaluation metrics; no free parameters or invented entities are introduced to support the claim.

axioms (1)

domain assumption Domain-adaptive continued pretraining improves model performance on downstream tasks when pretraining data sufficiently matches the target domain in proficiency and genre.
This assumption underpins the interpretation of both the mixed full-corpus results and the targeted ablation benefits.

pith-pipeline@v0.9.1-grok · 5747 in / 1453 out tokens · 52159 ms · 2026-06-29T21:58:40.962697+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 1 canonical work pages · 1 internal anchor

[1]

InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 6740– 6744, Marseille, France

Diverging divergences: Examining variants of Jensen Shannon divergence for corpus compari- son tasks. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 6740– 6744, Marseille, France. European Language Re- sources Association. Xiaofei Lu. 2010. Automatic analysis of syntactic com- plexity in second language writing.Internatio...

2010
[2]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 431– 439, Lisbon, Portugal. Association for Computational Linguistics. Zhuang Qiu, Peizhi Yan, and Zhenguang Cai. 2024. Large language models for second language English...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[3]

On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 3416–3425, Seattle, United States. Association for Computational Linguistics. Helen Yannakoudakis, Ted...

2022
[4]

InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA

A new dataset and method for automatically grading ESOL texts. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics. Helen Yannakoudakis and Ronan Cummins. 2015. Eval- uating the performance of automated text s...

2015

[1] [1]

InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 6740– 6744, Marseille, France

Diverging divergences: Examining variants of Jensen Shannon divergence for corpus compari- son tasks. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 6740– 6744, Marseille, France. European Language Re- sources Association. Xiaofei Lu. 2010. Automatic analysis of syntactic com- plexity in second language writing.Internatio...

2010

[2] [2]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Flexible domain adaptation for automated essay scoring using correlated linear regression. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 431– 439, Lisbon, Portugal. Association for Computational Linguistics. Zhuang Qiu, Peizhi Yan, and Zhenguang Cai. 2024. Large language models for second language English...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[3] [3]

On the use of bert for automated essay scoring: Joint learning of multi-scale essay representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 3416–3425, Seattle, United States. Association for Computational Linguistics. Helen Yannakoudakis, Ted...

2022

[4] [4]

InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA

A new dataset and method for automatically grading ESOL texts. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 180–189, Portland, Oregon, USA. Association for Computational Linguistics. Helen Yannakoudakis and Ronan Cummins. 2015. Eval- uating the performance of automated text s...

2015