pith. sign in

arxiv: 2506.04989 · v2 · submitted 2025-06-05 · 💻 cs.SE · cs.CY· cs.LG

BacPrep: Lessons from Deploying an LLM-Based Bacalaureat Assessment Platform

Pith reviewed 2026-05-19 11:16 UTC · model grok-4.3

classification 💻 cs.SE cs.CYcs.LG
keywords Bacalaureat assessmentLLM gradingautomated feedbackeducational platformprompt decompositiongrading consistencyGemini modelstudent solutions
0
0 comments X

The pith

BacPrep deployed an LLM grader on over 100 real Bacalaureat solutions and found repeated inconsistencies that motivate decomposed subject prompts and median scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BacPrep, a free online platform that lets Romanian students practice official Bacalaureat exam questions with automated LLM feedback. Using the Gemini Flash model on questions from the last five years, the authors gathered more than 100 actual student answers in computer science and Romanian language. Early runs exposed clear problems including different scores on identical answers, mistakes when adding fractional marks, weaker results on long prompts, ignored subject-specific weights, and feedback that contradicted the assigned score. These concrete failures prompted a new design that splits prompts by subject, assigns dedicated graders to each subject, and picks the middle score from several independent model calls. The platform's next required step is direct comparison against human expert grades to check whether the changes improve reliability.

Core claim

BacPrep collected over 100 student solutions across Computer Science and Romanian Language Bacalaureat exams and used them to test LLM grading, uncovering inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback; these findings support a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs.

What carries the argument

Subject-level prompt decomposition with specialized per-subject graders and median selection across multiple runs, which targets the observed grading inconsistencies and arithmetic errors.

If this is right

  • Median selection across repeated model runs reduces score variation on the same student answer.
  • Subject-specific prompt decomposition enables accurate application of each exam's distinct rubric weights.
  • Breaking assessment into per-subject graders limits performance loss from overly long combined prompts.
  • The platform can supply free feedback to remote students once human validation confirms the changes work.
  • Expert validation against the collected solutions will decide whether the redesign meets reliability thresholds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decomposition and median approach could apply to automated grading of other national exams that use detailed rubrics.
  • The set of real student solutions offers a ready dataset for training smaller models specialized in exam grading.
  • Pairing LLM output with simple external rules for adding fractional scores would directly fix the observed arithmetic mistakes.
  • Future iterations could separate qualitative comment generation from numeric scoring to eliminate contradictions between the two.

Load-bearing premise

The proposed subject-level prompt decomposition, specialized graders, and median-selection strategy will reduce inconsistencies and errors once expert human validation is completed.

What would settle it

Human experts grading the same collected solutions and showing that the median-from-multiple-runs approach produces scores closer to human judgment and more stable across repeats than the original single-run method.

Figures

Figures reproduced from arXiv: 2506.04989 by Adrian-Marius Dumitran, Angela Liliana Dumitran, Radu Dita.

Figure 1
Figure 1. Figure 1: Exam category and version selection. While the platform currently only includes "Computer Science" and "Roma￾nian" categories, adding more subjects and exam data is straightforward. We decided to focus on a small number of subjects and exam models as we want to collect a lot of data for these exam models so we can better tune our auto correcting LLMs in the future. Taking the Exam: After starting the exam,… view at source ↗
Figure 2
Figure 2. Figure 2: Example of a multiple-choice Computer Science question in SUBIECTUL I. Submission and Evaluation: After completing the test, students are shown their score along with a breakdown of each response. The explanation includes reasoning for the correct answer, often accompanied by code evaluations, anal￾ysis, or step-by-step deduction (Figures 3, 4) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Automated evaluation of programming responses with output and explanation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Feedback and scoring breakdown for mathematical/algorithmic questions. Session Resume and Progress Tracking: If a student exits the platform, their exam can be resumed later using the same email. The platform maintains state locally to support continuity in preparation ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Resume previous session interface with email persistence. This clean, minimal interface supports focused practice, while behind the scenes, all responses are logged for later expert validation and LLM comparison. 4.3 Ongoing Data Collection The priority remains building a robust dataset. The primary data collected are student solutions, paired with the corresponding question and of￾ficial grading scheme. T… view at source ↗
read the original abstract

Accessing quality preparation and feedback for the Romanian Bacalaureat exam is challenging, particularly for students in remote or underserved areas. This paper presents BacPrep, an experimental online platform exploring Large Language Model (LLM) potential for automated assessment, aiming to offer a free, accessible resource. Using official exam questions from the last 5 years, BacPrep employs the latest available Gemini Flash model (currently Gemini 2.5 Flash, via the \texttt{gemini-flash-latest} endpoint) to prioritize user experience quality during the data collection phase, with model versioning to be locked for subsequent rigorous evaluation. The platform has collected over 100 student solutions across Computer Science and Romanian Language exams, enabling preliminary assessment of LLM grading quality. This revealed several significant challenges: grading inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback. These findings motivate a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs. Expert validation against human-graded solutions remains the critical next step.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces BacPrep, an experimental platform using the Gemini Flash LLM to assess student solutions for the Romanian Bacalaureat exam in Computer Science and Romanian Language subjects. Drawing from over 100 collected student solutions based on official questions from the last 5 years, it details observed LLM grading challenges including inconsistency across multiple runs, arithmetic errors in fractional score aggregation, performance issues with large prompt contexts, failure to apply subject-specific rubric weightings, and inconsistencies between scores and qualitative feedback. These motivate a redesigned architecture with subject-level prompt decomposition, specialized per-subject graders, and median selection across multiple runs. Expert human validation is identified as the next critical step.

Significance. This deployment report provides concrete, real-world examples of LLM limitations in automated educational assessment for a high-stakes national exam. The collection of actual student solutions offers a valuable dataset for future studies. If the proposed redesign proves effective upon validation, it could significantly enhance accessible preparation tools for students in remote areas, contributing practical knowledge to the application of LLMs in education technology.

major comments (2)
  1. [Abstract, paragraph on preliminary assessment] No quantitative metrics are provided for the listed challenges (e.g., how often grading inconsistency occurs across runs or the magnitude of arithmetic errors), which weakens the ability to assess the problems' severity and the redesign's potential benefits.
  2. [Section describing the redesigned architecture] The subject-level prompt decomposition, specialized graders, and median-selection strategy are proposed to address the challenges, but the manuscript contains no implementation, no tests on the collected data, and no comparison results; since expert validation is future work, the assertion that this architecture will reduce the issues remains an untested assumption.
minor comments (1)
  1. [Model usage description] The reference to using 'gemini-flash-latest' endpoint and model versioning for later evaluation could be clarified regarding whether the model was held constant during the data collection phase.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments correctly identify areas where the manuscript could better distinguish between observed deployment issues and unvalidated proposals. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract, paragraph on preliminary assessment] No quantitative metrics are provided for the listed challenges (e.g., how often grading inconsistency occurs across runs or the magnitude of arithmetic errors), which weakens the ability to assess the problems' severity and the redesign's potential benefits.

    Authors: We agree that quantitative metrics would allow readers to better gauge the severity of the reported challenges. The current text reflects qualitative observations made during the initial data-collection phase with the live platform. Because the primary goal at that stage was to gather real student solutions rather than to run controlled measurement experiments, we did not systematically log inconsistency rates or error magnitudes. We will revise the abstract and the preliminary-assessment paragraph to state explicitly that the listed issues are qualitative findings from deployment and that quantitative characterization is reserved for the planned expert-validation study. If any post-hoc logs permit even rough estimates, we will include them; otherwise the text will note their absence. revision: yes

  2. Referee: [Section describing the redesigned architecture] The subject-level prompt decomposition, specialized graders, and median-selection strategy are proposed to address the challenges, but the manuscript contains no implementation, no tests on the collected data, and no comparison results; since expert validation is future work, the assertion that this architecture will reduce the issues remains an untested assumption.

    Authors: The referee is correct that the redesigned architecture is described but neither implemented nor evaluated on the collected data. The manuscript presents the redesign as a direct response to the observed failure modes, not as a claim of proven improvement. We will revise the architecture section to remove any phrasing that could be read as asserting efficacy and will instead label the proposal clearly as “planned future work” whose effectiveness will be assessed only after expert human grading and controlled experiments are completed. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical deployment report with untested future redesign

full rationale

The paper reports observations from deploying BacPrep, collecting >100 student solutions, and documenting specific LLM failures (inconsistency, arithmetic errors, rubric issues). It then motivates a proposed redesign (subject-level decomposition, median selection) as future work, explicitly noting that expert human validation remains to be done. No equations, fitted parameters, predictions, or self-citations appear in the text. The derivation chain is purely observational and does not reduce any claim to its own inputs by construction; the redesign is presented as an untested hypothesis rather than a derived result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work rests on standard assumptions about LLM behavior in grading tasks and the existence of official exam rubrics; no free parameters, invented entities, or non-standard axioms are introduced.

pith-pipeline@v0.9.0 · 5748 in / 1179 out tokens · 24187 ms · 2026-05-19T11:16:33.771547+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The platform has collected over 100 student solutions across Computer Science and Romanian Language exams, enabling preliminary assessment of LLM grading quality. This revealed several significant challenges: grading inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    These findings motivate a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    OpenAI: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Anthropic: Claude: https://www.anthropic.com/claude

  3. [3]

    Educational Psychologist46(4), 197–221 (2011)

    VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educational Psychologist46(4), 197–221 (2011)

  4. [4]

    Journal of Educational Psychology106(4), 901–918 (2014)

    Ma, W., Adesope, O.O., Nesbit, J.C., Liu, Q.: Intelligent Tutoring Systems and Learning Outcomes: A Meta-Analysis. Journal of Educational Psychology106(4), 901–918 (2014)

  5. [5]

    In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research, pp

    Ihantola, P., Ahoniemi, T., Karavirta, V., Seppälä, O.: Review of Recent Systems for Automatic Assessment of Programming Assignments. In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research, pp. 86–93 (2010)

  6. [6]

    Educational Measurement: Issues and Practice32(2), 3–14 (2013)

    Shermis, M.D., Burstein, J.: Contrasting State-of-the-Art Automated Scoring of Essays. Educational Measurement: Issues and Practice32(2), 3–14 (2013)

  7. [7]

    The Journal of Technology, Learning and Assessment4(3) (2006)

    Attali, Y., Burstein, J.: Automated Essay Scoring with e-rater V.2. The Journal of Technology, Learning and Assessment4(3) (2006)

  8. [8]

    arXiv preprint arXiv:2311.12780 (2023)

    Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., Carin, L.: GPT-Tutor: Learning to Teach Large Language Models. arXiv preprint arXiv:2311.12780 (2023)

  9. [9]

    Abd-alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, Aziz S, Damseh R, Alabed Alrazak S, Sheikh J Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions JMIR Med Educ 2023;9:e48291 doi: 10.2196/48291 PMID: 37261894 PMCID: 10273039 Title Suppressed Due to Excessive Length 9

  10. [10]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao et al.: Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. arXiv preprint arXiv:2206.04615 (2022)

  11. [11]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al.: On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258 (2021)

  12. [12]

    In: Zuin, A., Douligeris, C., Hanne, T

    Salloum, S.A., Alhamad, A.Q.M., Al-Emran, M., Abdel Monem, A., Shaalan, K.: Factors Affecting the Adoption of Artificial Intelligence in the Lebanese Education Sector. In: Zuin, A., Douligeris, C., Hanne, T. (eds.) Proceedings of the Interna- tional Conference on Artificial Intelligence and Computer Science (AICS2019), pp. 384–396. Wuhan Hubei China (2019)

  13. [13]

    Istrate,O.:DigitalLiteracyandEducation.NationalPoliciesacrossEurope.In:Ro- ceanu,I.(ed.)Proceedingsofthe13thInternationalScientificConferenceeLearning and Software for Education (eLSE), vol. 1, pp. 67–73. Carol I National Defence University Publishing House, Bucharest (2017)