Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

Jesse Roberts; Kelsey Rainey

arxiv: 2606.03814 · v1 · pith:25EGDF65new · submitted 2026-06-02 · 💻 cs.AI

Leveraging BART to Assess CS1 C++ Programming Assignments using Rubric-based Criteria

Kelsey Rainey , Jesse Roberts This is my paper

Pith reviewed 2026-06-28 09:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords automated gradingBARTrubric-based assessmentCS1 programmingmultitask learningsoft labelsgrade distribution alignment

0 comments

The pith

Multitask BART fine-tuned with rubric context and boundary-based soft labels produces lower mean absolute error and better grade-distribution alignment than single-task or code-only baselines on CS1 C++ assignments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether rubric-aware multitask training of a BART model on paired student code, numeric scores, letter-grade buckets, and assignment rubrics can generate predictions that more closely match instructor grading patterns. It compares single-task versus multitask setups, hard versus soft labels, and rubric versus code-only inputs while adding a term that matches predicted grade spreads to observed ones. A sympathetic reader would care because introductory programming classes often have high enrollment where consistent, scalable grading that respects rubrics and avoids skewed distributions could reduce instructor workload without sacrificing fairness. The experiments also include T5 and pairwise-pretrained variants to isolate the contribution of each design choice.

Core claim

Multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines.

What carries the argument

BART encoder-decoder with LoRA adaptation, trained jointly to predict numeric grades and grade buckets using boundary-based soft labels plus a distribution-matching loss, conditioned on unified sequences that include assignment rubrics.

If this is right

Multitask training on both numeric scores and grade buckets improves both error metrics and distributional match relative to single-task models.
Boundary-based soft labels produce stronger alignment with instructor behavior than hard one-hot labels.
Conditioning on rubric text reduces numeric error compared with code-only inputs.
Fully fine-tuned T5 models can further improve distributional fidelity beyond the BART results.
Pairwise pretraining lowers numeric error but reduces sensitivity on minority grade classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rubric-plus-distribution-matching approach could be tested on non-programming subjects where grading involves both quantitative scores and qualitative categories.
If the distribution-matching term proves robust, it might become a default component in automated assessment tools to prevent systematic grade inflation or deflation.
Future datasets could include multiple instructors per assignment to measure how well the model captures consensus versus individual variation.
The method might generalize to other languages if comparable rubric-annotated submission sets are collected.

Load-bearing premise

The multi-semester CS1 C++ dataset paired with rubrics is representative enough that models trained on it will produce instructor-like behavior on future assignments or with different instructors.

What would settle it

Apply the best-performing model to a fresh semester of C++ submissions from a different instructor and check whether mean absolute error stays as low and the predicted grade distribution remains as closely aligned as on the original test set.

Figures

Figures reproduced from arXiv: 2606.03814 by Jesse Roberts, Kelsey Rainey.

**Figure 3.** Figure 3: Visual of boundary fuzzy membership buckets [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution comparison of models for RQ2 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution comparison of models for RQ3 [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

This paper investigates rubric-aware, multitask fine-tuning of transformer models for automated grading of introductory C++ programming assignments, with the goal of producing grade predictions that better reflect instructor grading behavior than general-purpose LLMs. Using multi-semester CS1 data, student submissions are paired with numeric scores, letter-grade buckets, and assignment rubrics, then preprocessed into unified sequences for transformer input. A BART encoder-decoder with LoRA adaptation is trained to jointly predict numeric grades and grade buckets, augmented with a distribution-matching term to align predicted and empirical grade distributions, an evaluation dimension often overlooked in prior work. Experiments compare single-task and multitask training, hard one-hot versus fuzzy and boundary-based soft labels, and rubric versus no-rubric conditions, with additional T5 and pairwise-pretrained variants. Results show that multitask BART with boundary-based soft labels and rubric context achieves lower mean absolute error and stronger grade-distribution alignment than single-task, hard-label, or code-only baselines. Fully fine-tuned T5 further improves distributional fidelity, while pairwise pretraining reduces numeric error at the cost of minority-class sensitivity. Collectively, the findings suggest that calibration-aware, rubric-guided training produces more instructor-like grading behavior than accuracy-optimized alternatives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that multitask BART with rubric context and boundary soft labels beats baselines on in-distribution C++ grading metrics, but provides no test on new assignments or instructors.

read the letter

The main point is that they fine-tune BART with LoRA in a multitask setup on numeric grades plus buckets, add rubric text, use boundary-based soft labels, and include a distribution-matching loss. On their multi-semester CS1 C++ data this combination gives lower mean absolute error and better grade distribution alignment than single-task, hard-label, or code-only runs.

They run a useful set of controlled comparisons across training modes, label types, and a few model variants including T5. The distribution term addresses a gap that most automated grading papers ignore, and the boundary soft labels are a sensible choice for scores near bucket edges. Those choices are concrete and worth noting for anyone doing similar rubric-guided work.

The results stay inside the same collection of semesters and assignments. There is no temporal split, no new problems, and no different instructor rubrics, so the claim that the model produces instructor-like behavior on future work rests on an untested assumption. Dataset size, exact baseline code, and statistical tests are also not described in the abstract, which makes it hard to judge how stable the gains are.

This is for researchers building automated graders for introductory programming courses. Someone already working on C++ CS1 feedback tools could try the multitask and soft-label ideas, but the paper does not claim or demonstrate broader transfer.

I would send it to peer review. The empirical comparisons are the core contribution and they are specific enough that referees can evaluate the setup and ask for the missing generalization checks.

Referee Report

2 major / 2 minor

Summary. The paper investigates rubric-aware multitask fine-tuning of BART (with LoRA) on multi-semester CS1 C++ student submissions paired with numeric scores, letter-grade buckets, and assignment rubrics. It claims that jointly predicting numeric grades and buckets, using boundary-based soft labels and a distribution-matching term, yields lower mean absolute error and stronger alignment with empirical grade distributions than single-task, hard-label, code-only, or non-rubric baselines; additional variants with T5 and pairwise pretraining are compared.

Significance. If the central empirical claims hold and generalize, the work would offer a practical advance in automated assessment for introductory programming by incorporating rubric context and explicit distribution calibration, two elements often missing from prior LLM grading studies. The multitask and soft-label design choices address real instructor grading nuances.

major comments (2)

[Abstract] Abstract and experimental description: the reported improvements in MAE and grade-distribution alignment are presented without dataset size, number of assignments/semesters, statistical tests, exact baseline code, or ablation controls, preventing verification of robustness.
[Experimental description] Experimental description: all results are obtained from held-out data within the same multi-semester CS1 C++ collection; no temporal splits, new-assignment types, or different-instructor rubrics are described. This directly undermines the claim that the model produces 'instructor-like grading behavior on future assignments,' as the representativeness assumption remains untested.

minor comments (2)

[Methods] Notation for the boundary-based soft labels and the weight on the distribution-matching term should be defined explicitly in the methods section to aid reproducibility.
[Abstract] The abstract mentions 'fully fine-tuned T5' and 'pairwise pretraining' variants; a table summarizing all model configurations and hyper-parameters would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater experimental transparency and clearer scoping of generalization claims. We will revise the manuscript to incorporate the requested details and to temper language regarding future assignments. Below we respond point by point.

read point-by-point responses

Referee: [Abstract] Abstract and experimental description: the reported improvements in MAE and grade-distribution alignment are presented without dataset size, number of assignments/semesters, statistical tests, exact baseline code, or ablation controls, preventing verification of robustness.

Authors: We agree that the abstract and experimental description omit several details required for independent verification. In the revised version we will add the total number of submissions, semesters, and distinct assignments; report statistical significance (paired t-tests or Wilcoxon tests) for MAE differences; specify the exact baseline implementations (including the code-only BART configuration and hyper-parameters); and present the full set of ablation results in a dedicated table. These additions will be placed in both the abstract (concise form) and the experimental section. revision: yes
Referee: [Experimental description] Experimental description: all results are obtained from held-out data within the same multi-semester CS1 C++ collection; no temporal splits, new-assignment types, or different-instructor rubrics are described. This directly undermines the claim that the model produces 'instructor-like grading behavior on future assignments,' as the representativeness assumption remains untested.

Authors: We acknowledge the limitation. All reported results use random held-out splits drawn from the same multi-semester collection; no temporal, cross-assignment-type, or cross-instructor evaluations are present. We cannot retroactively create such splits without additional data. In revision we will (1) replace the phrase 'future assignments' with 'unseen submissions from the same course context' throughout the abstract, introduction, and conclusion, (2) add an explicit limitations paragraph stating that the current evaluation does not test temporal or cross-rubric generalization, and (3) outline planned future work on these dimensions. This keeps the claims commensurate with the evidence while preserving the contribution of rubric-aware multitask training within the studied distribution. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical model comparisons on held-out data

full rationale

The paper's central results consist of standard machine-learning experiments: BART (and T5) models are fine-tuned on a multi-semester CS1 C++ dataset with rubric context and evaluated via MAE and grade-distribution metrics on held-out submissions. No derivation step reduces a reported prediction to a fitted parameter by construction, no self-citation supplies a uniqueness theorem or ansatz that the present work merely renames, and the evaluation protocol does not equate the output metric to an input label. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer assumptions plus the domain assumption that rubric text and grade distributions from past semesters capture the relevant instructor behavior. One free parameter is the weight on the distribution-matching loss term.

free parameters (1)

weight on distribution-matching term
The abstract states the model is augmented with this term; its relative weight is a tunable hyperparameter that affects the reported alignment.

axioms (1)

domain assumption Multi-semester CS1 C++ submissions paired with rubrics form a representative training distribution for instructor grading behavior.
Invoked when the model is trained and evaluated on this data to claim instructor-like behavior.

pith-pipeline@v0.9.1-grok · 5746 in / 1363 out tokens · 28264 ms · 2026-06-28T09:38:54.021548+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 1 linked inside Pith

[1]

The widening gap: The benefits and harms of generative ai for novice programmers,

J. Prather, B. N. Reeves, J. Leinonen, S. MacNeil, A. S. Randrianasolo, B. A. Becker, B. Kimmel, J. Wright, and B. Briggs, “The widening gap: The benefits and harms of generative ai for novice programmers,” in Proceedings of the 2024 ACM Conference on International Computing Education Research-V olume 1, 2024, pp. 469–486

2024
[2]

Large language models (gpt) for automating feedback on programming assignments,

M. Pankiewicz and R. S. Baker, “Large language models (gpt) for automating feedback on programming assignments,”arXiv preprint arXiv:2307.00150, 2023

arXiv 2023
[3]

Intelligent assignment grading system based on bert for computer science course,

G. Jun and G. Y . Ting, “Intelligent assignment grading system based on bert for computer science course,” in2025 10th International Conference on Information and Network Technologies (ICINT). IEEE, 2025, pp. 168–174

2025
[4]

Codev-bench: How do llms understand developer-centric code completion?

Z. Pan, R. Cao, Y . Cao, Y . Ma, B. Li, F. Huang, H. Liu, and Y . Li, “Codev-bench: How do llms understand developer-centric code completion?”arXiv preprint arXiv:2410.01353, 2024

arXiv 2024
[5]

Evaluating large language models for criterion-based grading from agreement to consistency,

D.-W. Zhang, M. Boey, Y . Y . Tan, and A. H. S. Jia, “Evaluating large language models for criterion-based grading from agreement to consistency,”npj Science of Learning, vol. 9, no. 1, p. 79, 2024

2024
[6]

Begrading: large language models for enhanced feedback in programming education,

M. Yousef, K. Mohamed, W. Medhat, E. H. Mohamed, G. Khoriba, and T. Arafa, “Begrading: large language models for enhanced feedback in programming education,”Neural Computing and Applications, vol. 37, no. 2, pp. 1027–1040, 2025

2025
[7]

Rubric based automated short answer scoring using large language models (llms),

C. Senanayake and D. Asanka, “Rubric based automated short answer scoring using large language models (llms),” in2024 international re- search conference on smart computing and systems engineering (SCSE), vol. 7. IEEE, 2024, pp. 1–6

2024
[8]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,

A. Pathak, R. Gandhi, V . Uttam, A. Ramamoorthy, P. Ghosh, A. R. Jindal, S. Verma, A. Mittal, A. Ased, C. Khatriet al., “Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,” inProceedings of the 2025 ACM Conference on International Computing Education Research V . 1, 2025, pp. 181–195

2025
[9]

BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,”CoRR, vol. abs/1910.13461, 2019. [Online]. Available: http://arxiv.org/abs/1910.13461

Pith/arXiv arXiv 1910
[10]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

2020
[11]

How powerful are decoder-only transformer neural models?

J. Roberts, “How powerful are decoder-only transformer neural models?” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–8

2024
[12]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022
[13]

Multitask learning,

R. Caruana, “Multitask learning,”Machine learning, vol. 28, no. 1, pp. 41–75, 1997

1997
[14]

Application of fuzzy logic to approximate reasoning using linguistic synthesis,

Mamdani, “Application of fuzzy logic to approximate reasoning using linguistic synthesis,”IEEE transactions on computers, vol. 100, no. 12, pp. 1182–1191, 1977

1977

[1] [1]

The widening gap: The benefits and harms of generative ai for novice programmers,

J. Prather, B. N. Reeves, J. Leinonen, S. MacNeil, A. S. Randrianasolo, B. A. Becker, B. Kimmel, J. Wright, and B. Briggs, “The widening gap: The benefits and harms of generative ai for novice programmers,” in Proceedings of the 2024 ACM Conference on International Computing Education Research-V olume 1, 2024, pp. 469–486

2024

[2] [2]

Large language models (gpt) for automating feedback on programming assignments,

M. Pankiewicz and R. S. Baker, “Large language models (gpt) for automating feedback on programming assignments,”arXiv preprint arXiv:2307.00150, 2023

arXiv 2023

[3] [3]

Intelligent assignment grading system based on bert for computer science course,

G. Jun and G. Y . Ting, “Intelligent assignment grading system based on bert for computer science course,” in2025 10th International Conference on Information and Network Technologies (ICINT). IEEE, 2025, pp. 168–174

2025

[4] [4]

Codev-bench: How do llms understand developer-centric code completion?

Z. Pan, R. Cao, Y . Cao, Y . Ma, B. Li, F. Huang, H. Liu, and Y . Li, “Codev-bench: How do llms understand developer-centric code completion?”arXiv preprint arXiv:2410.01353, 2024

arXiv 2024

[5] [5]

Evaluating large language models for criterion-based grading from agreement to consistency,

D.-W. Zhang, M. Boey, Y . Y . Tan, and A. H. S. Jia, “Evaluating large language models for criterion-based grading from agreement to consistency,”npj Science of Learning, vol. 9, no. 1, p. 79, 2024

2024

[6] [6]

Begrading: large language models for enhanced feedback in programming education,

M. Yousef, K. Mohamed, W. Medhat, E. H. Mohamed, G. Khoriba, and T. Arafa, “Begrading: large language models for enhanced feedback in programming education,”Neural Computing and Applications, vol. 37, no. 2, pp. 1027–1040, 2025

2025

[7] [7]

Rubric based automated short answer scoring using large language models (llms),

C. Senanayake and D. Asanka, “Rubric based automated short answer scoring using large language models (llms),” in2024 international re- search conference on smart computing and systems engineering (SCSE), vol. 7. IEEE, 2024, pp. 1–6

2024

[8] [8]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,

A. Pathak, R. Gandhi, V . Uttam, A. Ramamoorthy, P. Ghosh, A. R. Jindal, S. Verma, A. Mittal, A. Ased, C. Khatriet al., “Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,” inProceedings of the 2025 ACM Conference on International Computing Education Research V . 1, 2025, pp. 181–195

2025

[9] [9]

BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,

M. Lewis, Y . Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V . Stoyanov, and L. Zettlemoyer, “BART: denoising sequence-to- sequence pre-training for natural language generation, translation, and comprehension,”CoRR, vol. abs/1910.13461, 2019. [Online]. Available: http://arxiv.org/abs/1910.13461

Pith/arXiv arXiv 1910

[10] [10]

Exploring the limits of transfer learning with a unified text-to-text transformer,

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020

2020

[11] [11]

How powerful are decoder-only transformer neural models?

J. Roberts, “How powerful are decoder-only transformer neural models?” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–8

2024

[12] [12]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” Iclr, vol. 1, no. 2, p. 3, 2022

2022

[13] [13]

Multitask learning,

R. Caruana, “Multitask learning,”Machine learning, vol. 28, no. 1, pp. 41–75, 1997

1997

[14] [14]

Application of fuzzy logic to approximate reasoning using linguistic synthesis,

Mamdani, “Application of fuzzy logic to approximate reasoning using linguistic synthesis,”IEEE transactions on computers, vol. 100, no. 12, pp. 1182–1191, 1977

1977