Has Automated Essay Scoring Reached Sufficient Accuracy? Deriving Achievable QWK Ceilings from Classical Test Theory
Pith reviewed 2026-05-10 01:46 UTC · model grok-4.3
The pith
Classical test theory yields two dataset-specific QWK ceilings that show how much headroom remains for automated essay scoring even after accounting for human rater noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Two dataset-specific QWK ceilings can be derived from the reliability coefficient in classical test theory, estimated from standard two-rater benchmarks without extra annotation: the theoretical ceiling is the maximum QWK an ideal AES model that perfectly predicts latent true scores can achieve under the observed label noise, while the human-like ceiling is the QWK attainable by an AES model whose scoring error matches that of a single human rater; human-human QWK often underestimates these true ceilings.
What carries the argument
Derivation of QWK ceilings from the reliability coefficient of classical test theory, where reliability is computed from inter-rater agreement to quantify the noise level separating observed scores from latent true scores.
If this is right
- AES evaluation should compare model QWK not only to human-human agreement but also to these dataset-specific theoretical and human-like ceilings.
- A model reaching the human-like ceiling would be suitable for deployment in settings where it replaces one human rater.
- Current AES models can be re-assessed to quantify remaining headroom relative to both ceilings rather than to human-human QWK alone.
- The ceilings can be computed for any existing two-rater benchmark without new data collection.
Where Pith is reading between the lines
- Adopting these ceilings could shift AES research priorities toward models that explicitly model or correct for rater noise rather than simply maximizing agreement with noisy labels.
- Similar reliability-based ceiling derivations might apply to other subjective scoring tasks such as short-answer grading or peer review evaluation.
- If multi-rater data become available, the derived ceilings could be directly validated by measuring how close real high-quality models come to the theoretical limit.
- Deployment decisions in education might use the human-like ceiling as a minimum viable target when cost savings from automation are the goal.
Load-bearing premise
Classical test theory applies directly to essay scoring, with true latent scores existing and rater errors being random and uncorrelated.
What would settle it
Collect scores from many raters on an existing benchmark, compute the average as a proxy for latent true scores, then check whether an AES model trained to predict those averages achieves a QWK exceeding the theoretical ceiling derived from the two-rater reliability on the same data.
Figures
read the original abstract
Automated essay scoring (AES) is commonly evaluated on public benchmarks using quadratic weighted kappa (QWK). However, because benchmark labels are assigned by human raters and inevitably contain scoring errors, it remains unclear both what QWK is theoretically attainable and what level is practically sufficient for deployment. We therefore derive two dataset-specific QWK ceilings based on the reliability concept in classical test theory, which can be estimated from standard two-rater benchmarks without additional annotation. The first is the theoretical ceiling: the maximum QWK that an ideal AES model that perfectly predicts latent true scores can achieve under label noise. The second is the human-like ceiling: the QWK attainable by an AES model with human-level scoring error, providing a practical target when AES is intended to replace a single human rater. We further show that human--human QWK, often used as a ceiling reference, can underestimate the true ceiling. Simulation experiments validate the proposed ceilings, and experiments on real benchmarks illustrate how they clarify the current performance and remaining headroom of modern AES models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper derives two dataset-specific QWK ceilings for automated essay scoring from classical test theory reliability (estimated via two-rater annotations): a theoretical ceiling for an ideal model perfectly recovering latent true scores under label noise, and a human-like ceiling for an AES system with human-level error variance. It argues these ceilings exceed typical human-human QWK references, validates the derivations via simulations, and applies them to real AES benchmarks to assess current headroom.
Significance. If the CTT model applies, the approach supplies a practical, annotation-free method for setting realistic performance targets and interpreting benchmark results in AES, addressing a long-standing ambiguity about what QWK values are attainable or sufficient. The simulation validation and use of standard two-rater data are strengths that make the ceilings falsifiable and reproducible when assumptions hold.
major comments (1)
- [§3] §3 (Derivation of ceilings): The central claim rests on CTT assumptions that observed scores = true score + independent random error and that two-rater reliability directly estimates the noise an AES would face. The manuscript does not address the possibility that essay raters share systematic biases (content, length, style), which would violate uncorrelated errors and potentially bias both ceilings; a sensitivity analysis or explicit justification is needed for the claim that ceilings can be reliably computed from standard two-rater data.
minor comments (1)
- [Abstract and §4] The abstract and §4 mention simulation validation and real-benchmark experiments but omit explicit details on data exclusion criteria and error propagation; adding these would improve reproducibility without altering the core argument.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address the single major comment below and will revise the paper to incorporate an explicit discussion of the relevant assumptions.
read point-by-point responses
-
Referee: [§3] §3 (Derivation of ceilings): The central claim rests on CTT assumptions that observed scores = true score + independent random error and that two-rater reliability directly estimates the noise an AES would face. The manuscript does not address the possibility that essay raters share systematic biases (content, length, style), which would violate uncorrelated errors and potentially bias both ceilings; a sensitivity analysis or explicit justification is needed for the claim that ceilings can be reliably computed from standard two-rater data.
Authors: We agree that the independence of measurement errors is a foundational assumption of the classical test theory model used in §3. Shared systematic biases among raters (e.g., consistent leniency on length or style) would introduce correlated errors, which in turn would inflate the estimated reliability coefficient and produce higher QWK ceilings than would hold under fully independent errors. The original manuscript presents the derivations under the standard CTT framework without an explicit caveat on this point. In the revised manuscript we will expand the opening paragraphs of §3 to state the independence assumption clearly, note that AES benchmark protocols typically train and calibrate raters to reduce such biases, and explain that any residual common variance is absorbed into the true-score component of the reliability estimate. We will further observe that the resulting ceilings remain interpretable as dataset-specific upper bounds given the observed labeling process. A quantitative sensitivity analysis would require additional multi-rater data with explicit bias annotations, which are unavailable in the standard two-rater benchmarks we employ; we will therefore limit the addition to a qualitative discussion of the direction and magnitude of the effect. revision: yes
Circularity Check
No circularity: ceilings derived from external CTT formulas applied to independent rater data
full rationale
The paper applies standard classical test theory reliability formulas to two-rater correlation data to compute dataset-specific QWK ceilings. The theoretical ceiling is the expected QWK between a perfect latent-score predictor and noisy labels; the human-like ceiling models an AES with human error variance. These quantities are obtained directly from the CTT decomposition (observed score = true score + error) and the reliability coefficient estimated from separate annotations, without fitting any parameter to the AES model's own QWK values, without redefining the target metric in terms of itself, and without load-bearing self-citations. The derivation therefore remains independent of the quantities it seeks to bound.
Axiom & Free-Parameter Ledger
free parameters (1)
- reliability coefficient
axioms (1)
- domain assumption Classical test theory holds: observed score equals true score plus random measurement error
Reference graph
Works this paper leans on
-
[1]
Amorim, E., Cançado, M., Veloso, A.: Automated essay scoring in the presence of biased ratings. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics. pp. 229–237 (2018)
work page 2018
-
[2]
Biometrics63(4), 1099–1106 (2007)
Barnhart, H.X., Haber, M.J., Lin, L.I.: Overall concordance correlation coefficient for evaluating agreement among multiple observers. Biometrics63(4), 1099–1106 (2007)
work page 2007
-
[3]
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, New York (2006)
work page 2006
-
[4]
Springer-Verlag, New York (2001)
Brennan, R.L.: Generalizability Theory. Springer-Verlag, New York (2001)
work page 2001
-
[5]
Journal of Artificial Intelligence in Education25, 60–117 (2015)
Burrows, S., Gurevych, I., Stein, B.: The eras and trends of automatic short answer grading. Journal of Artificial Intelligence in Education25, 60–117 (2015)
work page 2015
-
[6]
In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
Chen, H., He, B.: Automated essay scoring by maximizing human-machine agree- ment. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. pp. 1741–1752 (2013)
work page 2013
-
[7]
In: Proceedings of the International Conference on Artificial Intelligence in Education
Chen, S., Lan, Y., Yuan, Z.: A multi-task automated assessment system for essay scoring. In: Proceedings of the International Conference on Artificial Intelligence in Education. pp. 276–283 (2024)
work page 2024
-
[8]
In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
Cozma, M., Butnaru, A., Ionescu, R.T.: Automated essay scoring with string ker- nels and word embeddings. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. pp. 503–509 (2018)
work page 2018
-
[9]
Eckes, T.: Introduction to Many-Facet Rasch Measurement: Analyzing and Eval- uating Rater-Mediated Assessments. Peter Lang Pub. Inc. (2023)
work page 2023
-
[10]
ElMassry, A.M., Zaki, N., AlSheikh, N., Mediani, M.: A systematic review of pre- trained models in automated essay scoring. IEEE Access (2025)
work page 2025
-
[11]
Language Testing27(3), 317–334 (2010)
Enright, M.K., Quinlan, T.: Complementing human judgment of essays written by english language learners with e-rater scoring. Language Testing27(3), 317–334 (2010)
work page 2010
-
[12]
In: Proceedings of the International Joint Conference on Artificial Intelligence
Ke, Z., Ng, V.: Automated essay scoring: A survey of the state of the art. In: Proceedings of the International Joint Conference on Artificial Intelligence. pp. 6300–6308 (2019) Has Automated Essay Scoring Reached Sufficient Accuracy? 15
work page 2019
-
[13]
In: Findings of the Association for Compu- tational Linguistics
Lee, S., Cai, Y., Meng, D., Wang, Z., Wu, Y.: Unleashing large language models’ proficiency in zero-shot essay scoring. In: Findings of the Association for Compu- tational Linguistics. pp. 181–198 (2024)
work page 2024
-
[14]
Mansour, W.A., Albatarni, S., Eltanbouly, S., Elsayed, T.: Can large language models automatically score proficiency of written essays? In: Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Re- sources and Evaluation. pp. 2777–2786 (2024)
work page 2024
-
[15]
Psychological Methods1(1), 30–46 (1996)
McGraw, K.O., Wong, S.P.: Forming inferences about some intraclass correlation coefficients. Psychological Methods1(1), 30–46 (1996)
work page 1996
-
[16]
Artificial Intelligence Review58(2), 36 (2024)
Misgna, H., On, B.W., Lee, I., Choi, G.S.: A survey on deep learning-based auto- mated essay scoring and feedback generation. Artificial Intelligence Review58(2), 36 (2024)
work page 2024
-
[17]
Artificial Intelligence Review55(3), 2495–2527 (2022)
Ramesh, D., Sanampudi, S.K.: An automated essay scoring systems: a systematic literature review. Artificial Intelligence Review55(3), 2495–2527 (2022)
work page 2022
-
[18]
Assessing Writing18(1), 25–39 (2013)
Ramineni, C., Williamson, D.M.: Automated essay scoring: Psychometric guide- lines and practices. Assessing Writing18(1), 25–39 (2013)
work page 2013
-
[19]
Shermis, M.D., Burstein, J.C.: Automated Essay Scoring: A Cross-disciplinary Perspective. Routledge (2003)
work page 2003
-
[20]
IEEE Access13, 184792–184808 (2025)
Shibata, T., Uto, M.: Cross-prompt automated essay scoring via reinforcement learning-based data valuation. IEEE Access13, 184792–184808 (2025)
work page 2025
-
[21]
In: Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing
Taghipour, K., Ng, H.T.: A neural approach to automated essay scoring. In: Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing. pp. 1882–1891 (2016)
work page 2016
-
[22]
Behaviormetrika 48(2), 459–484 (2021)
Uto, M.: A review of deep-neural automated essay scoring models. Behaviormetrika 48(2), 459–484 (2021)
work page 2021
-
[23]
IEEE Transactions on Learn- ing Technologies14(6), 763–776 (2021)
Uto,M.,Okano,M.:Learningautomatedessayscoringmodelsusingitem-response- theory-based scores to decrease effects of rater biases. IEEE Transactions on Learn- ing Technologies14(6), 763–776 (2021)
work page 2021
-
[24]
Behaviormetrika47(2), 469–496 (2020)
Uto, M., Ueno, M.: A generalized many-facet Rasch model and its Bayesian esti- mation using Hamiltonian Monte Carlo. Behaviormetrika47(2), 469–496 (2020)
work page 2020
-
[25]
International Journal of Testing18(1), 27–49 (2018)
Wind, S.A., Wolfe, E.W., Jr., G.E., Foltz, P., Rosenstein, M.: The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing18(1), 27–49 (2018)
work page 2018
-
[26]
In: Proceedings of the 29th International Conference on Computational Linguistics
Xie, J., Cai, K., Kong, L., Zhou, J., Qu, W.: Automated essay scoring via pairwise contrastive regression. In: Proceedings of the 29th International Conference on Computational Linguistics. pp. 2724–2733 (2022)
work page 2022
-
[27]
In: Proceedings of the International Conference on Artificial In- telligence in Education
Yamaura, M., Fukuda, I., Uto, M.: Neural automated essay scoring considering logical structure. In: Proceedings of the International Conference on Artificial In- telligence in Education. pp. 267–278 (2023)
work page 2023
-
[28]
In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications
Yancey, K.P., Laflair, G., Verardi, A., Burstein, J.: Rating short L2 essays on the CEFR scale with GPT-4. In: Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications. pp. 576–584 (2023)
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.