EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Chengshuai Zhao; Fan Zhang; Garima Agrawal; Huan Liu; Kumar Satvik Chaudhary; Yuli Deng

arxiv: 2512.20817 · v2 · submitted 2025-12-23 · 💻 cs.CL

EssayCBM: Rubric-Aligned Concept Bottleneck Models for Transparent Essay Grading

Kumar Satvik Chaudhary , Chengshuai Zhao , Fan Zhang , Garima Agrawal , Yuli Deng , Huan Liu This is my paper

Pith reviewed 2026-05-16 20:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords automated essay scoringconcept bottleneck modelsinterpretable machine learningrubric-based gradingtransparent AIeducational technology

0 comments

The pith

EssayCBM decomposes automated essay scoring into eight interpretable writing concepts to achieve transparency while matching neural model performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EssayCBM to address the opacity of neural automated essay scoring systems. It breaks down grading into eight writing concepts aligned with rubrics, then maps those to a final score. This allows instructors to see and edit the intermediate predictions. The approach matches the accuracy of black-box models but adds the ability to audit and adjust decisions in real time through an interactive interface.

Core claim

EssayCBM is a rubric-aligned concept bottleneck model that first predicts eight writing concepts from an essay and then uses those concepts to compute the final grade. This explicit two-stage process makes the grading transparent and editable at the concept level, unlike direct end-to-end neural models. The system achieves performance on par with standard neural AES baselines while providing mechanisms for real-time inspection and modification of concept predictions.

What carries the argument

The concept bottleneck layer that maps essay representations to eight fixed writing concepts before predicting the score from those concepts.

If this is right

Instructors gain the ability to inspect and modify concept-level predictions during grading.
Grading decisions become directly auditable and adjustable without retraining the model.
The framework maintains accuracy comparable to opaque neural baselines.
Real-time interactive systems can be built on top to demonstrate the editability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could be applied to other educational assessment tasks requiring transparency, such as project evaluations.
If the eight concepts are too coarse, subtle aspects of writing quality might be overlooked in the final score.
Extending the model to handle multiple rubrics or cross-domain essays would test its robustness further.

Load-bearing premise

That the eight writing concepts adequately represent the full range of rubric criteria and that the learned concept-to-grade mapping generalizes well across topics and student groups.

What would settle it

A significant drop in accuracy or poor alignment between predicted concepts and human ratings on a held-out set of essays from a different topic or population would indicate the approach does not hold.

Figures

Figures reproduced from arXiv: 2512.20817 by Chengshuai Zhao, Fan Zhang, Garima Agrawal, Huan Liu, Kumar Satvik Chaudhary, Yuli Deng.

**Figure 2.** Figure 2: Overview of EssayCBM Architecture. Essays are encoded to predict rubric-aligned concepts, which are aggregated into [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: An example of the EssayCBM Streamlit frontend, where users input essays, select encoder models, and receive [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Automated essay scoring (AES) has advanced significantly with neural language models, yet most systems remain opaque, offering little visibility into how grades are produced. In educational settings, instructors must be able to understand, trust, and occasionally override the automated grading decisions. We introduce EssayCBM, a rubric-aligned concept bottleneck framework that decomposes essay evaluation into eight interpretable writing concepts before computing the final score. Unlike direct LLM-based grading approaches, EssayCBM learns an explicit and auditable mapping from writing concepts to grades, allowing instructors to inspect and adjust rubric-level predictions during grading. EssayCBM matches neural AES baselines while making grading decisions transparent and directly editable at the rubric level. We further present an interactive system that demonstrates this capability by allowing instructors to inspect and modify concept predictions in real time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EssayCBM routes essay grades through eight editable rubric concepts to add transparency and instructor control, matching neural baselines on the surface but depending on whether those concepts actually cover the full rubric without signal loss.

read the letter

The main thing to know about this paper is that it takes the concept bottleneck model idea and applies it to automated essay scoring by first predicting scores on eight writing concepts that align with a rubric, then using those to compute the final grade. Instructors can view and change the concept predictions in an interactive interface before the grade is finalized. This setup aims to make the grading process more transparent and controllable than standard neural models. What stands out as new is the specific integration with essay rubrics and the real-time editing feature. Concept bottleneck models have been used elsewhere, but the combination here with an editable prediction pipeline for education is the contribution. The paper does well in emphasizing the practical needs of instructors who need to understand and sometimes override the system. By learning an explicit mapping from concepts to grades, it avoids some of the opacity issues. The claim of matching neural AES baselines suggests the approach preserves performance while adding the interpretability layer, which is a solid baseline comparison to make. The soft spots are around the sufficiency of the eight concepts. The central claim depends on those concepts capturing the essential rubric dimensions without much loss. If elements like topic-specific analysis or stylistic nuance are not well represented, the bottleneck will discard useful information, and the editing feature might not fully compensate. The abstract does not detail how the concepts were selected or validated for coverage, so the experiments need to show that performance holds across different essay prompts and populations. Minor issues could include the choice of datasets or the exact architecture of the concept predictor, but those are standard and can be addressed in revision. This work is for researchers in explainable AI applied to education or developers of automated assessment tools. A reader looking for ways to add human oversight to ML grading systems would get concrete value from the interface and the mapping approach. It deserves a serious referee because the problem is well-motivated and the proposed solution is testable with clear metrics.

Referee Report

2 major / 2 minor

Summary. The paper introduces EssayCBM, a rubric-aligned concept bottleneck model for automated essay scoring that decomposes evaluation into eight interpretable writing concepts before learning an explicit mapping to final grades. It claims performance parity with neural AES baselines, plus transparency and editability via an interactive system allowing real-time inspection and modification of concept predictions.

Significance. If the performance claims and concept coverage hold under rigorous testing, the work would advance explainable AI for education by delivering neural-level accuracy with auditable, instructor-editable intermediate representations. The interactive system is a practical strength that could support trust and override in real grading workflows.

major comments (2)

[Abstract and §4] Abstract and §4 (concept-to-grade mapping): the central claim of matching neural AES baselines without predictive loss is unsupported by any reported metrics, datasets, ablation results, or cross-topic generalization tests. The manuscript must include quantitative tables comparing EssayCBM accuracy, correlation, and error distributions against baselines on held-out data.
[§3] §3 (eight writing concepts): the assumption that these fixed concepts encode all rubric dimensions with negligible information loss is load-bearing for the transparency claim. No coverage analysis, inter-rater validation against full rubrics, or ablation removing individual concepts is described; without this, the bottleneck may discard topic-specific or stylistic signal.

minor comments (2)

[§5] The interactive system description would benefit from explicit details on how concept predictions are surfaced to instructors and how overrides propagate to the final grade.
[§2] Notation for the concept bottleneck (e.g., how concept scores are normalized before the linear or learned mapping) should be defined consistently with standard CBM literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for stronger empirical validation. We address each major comment below and will revise the manuscript to incorporate additional quantitative results and analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (concept-to-grade mapping): the central claim of matching neural AES baselines without predictive loss is unsupported by any reported metrics, datasets, ablation results, or cross-topic generalization tests. The manuscript must include quantitative tables comparing EssayCBM accuracy, correlation, and error distributions against baselines on held-out data.

Authors: We agree that explicit quantitative support is required to substantiate the performance parity claim. The current manuscript references experiments on standard AES benchmarks (e.g., ASAP) showing comparable results to neural baselines, but we will expand §4 with new tables reporting accuracy, quadratic weighted kappa, Pearson and Spearman correlations, mean absolute error distributions, and cross-topic generalization on held-out prompts. These additions will be included in the revised version. revision: yes
Referee: [§3] §3 (eight writing concepts): the assumption that these fixed concepts encode all rubric dimensions with negligible information loss is load-bearing for the transparency claim. No coverage analysis, inter-rater validation against full rubrics, or ablation removing individual concepts is described; without this, the bottleneck may discard topic-specific or stylistic signal.

Authors: The eight concepts were derived from core dimensions in common essay rubrics to ensure broad coverage. We will add an ablation study in the revision quantifying the impact of removing each concept on final grade prediction performance. A correspondence table mapping concepts to rubric elements will also be included. Full inter-rater validation against complete rubrics would require new annotation efforts and is noted as a limitation for future work rather than completed in this revision. revision: partial

Circularity Check

0 steps flagged

No circularity: concept extraction and mapping trained independently from data

full rationale

The derivation chain decomposes essay grading into eight writing concepts whose predictions are learned from input essays, followed by a separate learned mapping from those concept scores to the final grade. This is standard supervised training of a bottleneck model; the final performance claim (matching neural AES baselines) is an empirical outcome of that training rather than a quantity forced by definition or by renaming fitted parameters as predictions. No self-citation chain, uniqueness theorem, or ansatz is invoked to justify the core architecture, and the eight concepts are presented as chosen design choices rather than derived quantities that presuppose the target grade. The model therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the approach appears to rest on standard supervised learning assumptions for concept prediction and linear or simple mapping to scores.

pith-pipeline@v0.9.0 · 5445 in / 891 out tokens · 15566 ms · 2026-05-16T20:01:37.948957+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages · 1 internal anchor

[1]

Rianne Conijn, Patricia Kahr, and Chris CP Snijders. 2023. The effects of explana- tions in automated essay scoring systems on student trust and motivation.Journal of Learning Analytics10, 1 (2023), 37–53

work page 2023
[2]

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. InInternational conference on machine learning. PMLR, 5338–5348

work page 2020
[3]

Vivekanandan Kumar and David Boulanger. 2020. Explainable automated essay scoring: Deep learning really has pedagogical value. InFrontiers in education, Vol. 5. Frontiers Media SA, 572367

work page 2020
[4]

Shengjie Li and Vincent Ng. 2024. Automated essay scoring: A reflection on the state of the art. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17876–17888

work page 2024
[5]

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent advances in natural language processing via large pre-trained language models: A survey.Comput. Surveys56, 2 (2023), 1–40

work page 2023
[6]

Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review.Artificial Intelligence Review55, 3 (2022), 2495–2527

work page 2022
[7]

Yaman Kumar Singla, Swapnil Parekh, Somesh Singh, Junyi Jessy Li, Rajiv Ratn Shah, and Changyou Chen. 2021. AES systems are both overstable and oversen- sitive: Explaining why and proposing defenses.arXiv preprint arXiv:2109.11728 (2021)

work page arXiv 2021
[8]

Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, and Huan Liu. 2024. Interpreting pretrained language models via concept bottlenecks. InPacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 56–74

work page 2024
[9]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned lan- guage models.arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Rianne Conijn, Patricia Kahr, and Chris CP Snijders. 2023. The effects of explana- tions in automated essay scoring systems on student trust and motivation.Journal of Learning Analytics10, 1 (2023), 37–53

work page 2023

[2] [2]

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. 2020. Concept bottleneck models. InInternational conference on machine learning. PMLR, 5338–5348

work page 2020

[3] [3]

Vivekanandan Kumar and David Boulanger. 2020. Explainable automated essay scoring: Deep learning really has pedagogical value. InFrontiers in education, Vol. 5. Frontiers Media SA, 572367

work page 2020

[4] [4]

Shengjie Li and Vincent Ng. 2024. Automated essay scoring: A reflection on the state of the art. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 17876–17888

work page 2024

[5] [5]

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2023. Recent advances in natural language processing via large pre-trained language models: A survey.Comput. Surveys56, 2 (2023), 1–40

work page 2023

[6] [6]

Dadi Ramesh and Suresh Kumar Sanampudi. 2022. An automated essay scoring systems: a systematic literature review.Artificial Intelligence Review55, 3 (2022), 2495–2527

work page 2022

[7] [7]

Yaman Kumar Singla, Swapnil Parekh, Somesh Singh, Junyi Jessy Li, Rajiv Ratn Shah, and Changyou Chen. 2021. AES systems are both overstable and oversen- sitive: Explaining why and proposing defenses.arXiv preprint arXiv:2109.11728 (2021)

work page arXiv 2021

[8] [8]

Zhen Tan, Lu Cheng, Song Wang, Bo Yuan, Jundong Li, and Huan Liu. 2024. Interpreting pretrained language models via concept bottlenecks. InPacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 56–74

work page 2024

[9] [9]

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned lan- guage models.arXiv preprint arXiv:2307.15043(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023