BacPrep: Lessons from Deploying an LLM-Based Bacalaureat Assessment Platform
Pith reviewed 2026-05-19 11:16 UTC · model grok-4.3
The pith
BacPrep deployed an LLM grader on over 100 real Bacalaureat solutions and found repeated inconsistencies that motivate decomposed subject prompts and median scoring.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BacPrep collected over 100 student solutions across Computer Science and Romanian Language Bacalaureat exams and used them to test LLM grading, uncovering inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback; these findings support a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs.
What carries the argument
Subject-level prompt decomposition with specialized per-subject graders and median selection across multiple runs, which targets the observed grading inconsistencies and arithmetic errors.
If this is right
- Median selection across repeated model runs reduces score variation on the same student answer.
- Subject-specific prompt decomposition enables accurate application of each exam's distinct rubric weights.
- Breaking assessment into per-subject graders limits performance loss from overly long combined prompts.
- The platform can supply free feedback to remote students once human validation confirms the changes work.
- Expert validation against the collected solutions will decide whether the redesign meets reliability thresholds.
Where Pith is reading between the lines
- The same decomposition and median approach could apply to automated grading of other national exams that use detailed rubrics.
- The set of real student solutions offers a ready dataset for training smaller models specialized in exam grading.
- Pairing LLM output with simple external rules for adding fractional scores would directly fix the observed arithmetic mistakes.
- Future iterations could separate qualitative comment generation from numeric scoring to eliminate contradictions between the two.
Load-bearing premise
The proposed subject-level prompt decomposition, specialized graders, and median-selection strategy will reduce inconsistencies and errors once expert human validation is completed.
What would settle it
Human experts grading the same collected solutions and showing that the median-from-multiple-runs approach produces scores closer to human judgment and more stable across repeats than the original single-run method.
Figures
read the original abstract
Accessing quality preparation and feedback for the Romanian Bacalaureat exam is challenging, particularly for students in remote or underserved areas. This paper presents BacPrep, an experimental online platform exploring Large Language Model (LLM) potential for automated assessment, aiming to offer a free, accessible resource. Using official exam questions from the last 5 years, BacPrep employs the latest available Gemini Flash model (currently Gemini 2.5 Flash, via the \texttt{gemini-flash-latest} endpoint) to prioritize user experience quality during the data collection phase, with model versioning to be locked for subsequent rigorous evaluation. The platform has collected over 100 student solutions across Computer Science and Romanian Language exams, enabling preliminary assessment of LLM grading quality. This revealed several significant challenges: grading inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback. These findings motivate a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs. Expert validation against human-graded solutions remains the critical next step.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BacPrep, an experimental platform using the Gemini Flash LLM to assess student solutions for the Romanian Bacalaureat exam in Computer Science and Romanian Language subjects. Drawing from over 100 collected student solutions based on official questions from the last 5 years, it details observed LLM grading challenges including inconsistency across multiple runs, arithmetic errors in fractional score aggregation, performance issues with large prompt contexts, failure to apply subject-specific rubric weightings, and inconsistencies between scores and qualitative feedback. These motivate a redesigned architecture with subject-level prompt decomposition, specialized per-subject graders, and median selection across multiple runs. Expert human validation is identified as the next critical step.
Significance. This deployment report provides concrete, real-world examples of LLM limitations in automated educational assessment for a high-stakes national exam. The collection of actual student solutions offers a valuable dataset for future studies. If the proposed redesign proves effective upon validation, it could significantly enhance accessible preparation tools for students in remote areas, contributing practical knowledge to the application of LLMs in education technology.
major comments (2)
- [Abstract, paragraph on preliminary assessment] No quantitative metrics are provided for the listed challenges (e.g., how often grading inconsistency occurs across runs or the magnitude of arithmetic errors), which weakens the ability to assess the problems' severity and the redesign's potential benefits.
- [Section describing the redesigned architecture] The subject-level prompt decomposition, specialized graders, and median-selection strategy are proposed to address the challenges, but the manuscript contains no implementation, no tests on the collected data, and no comparison results; since expert validation is future work, the assertion that this architecture will reduce the issues remains an untested assumption.
minor comments (1)
- [Model usage description] The reference to using 'gemini-flash-latest' endpoint and model versioning for later evaluation could be clarified regarding whether the model was held constant during the data collection phase.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments correctly identify areas where the manuscript could better distinguish between observed deployment issues and unvalidated proposals. We respond to each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract, paragraph on preliminary assessment] No quantitative metrics are provided for the listed challenges (e.g., how often grading inconsistency occurs across runs or the magnitude of arithmetic errors), which weakens the ability to assess the problems' severity and the redesign's potential benefits.
Authors: We agree that quantitative metrics would allow readers to better gauge the severity of the reported challenges. The current text reflects qualitative observations made during the initial data-collection phase with the live platform. Because the primary goal at that stage was to gather real student solutions rather than to run controlled measurement experiments, we did not systematically log inconsistency rates or error magnitudes. We will revise the abstract and the preliminary-assessment paragraph to state explicitly that the listed issues are qualitative findings from deployment and that quantitative characterization is reserved for the planned expert-validation study. If any post-hoc logs permit even rough estimates, we will include them; otherwise the text will note their absence. revision: yes
-
Referee: [Section describing the redesigned architecture] The subject-level prompt decomposition, specialized graders, and median-selection strategy are proposed to address the challenges, but the manuscript contains no implementation, no tests on the collected data, and no comparison results; since expert validation is future work, the assertion that this architecture will reduce the issues remains an untested assumption.
Authors: The referee is correct that the redesigned architecture is described but neither implemented nor evaluated on the collected data. The manuscript presents the redesign as a direct response to the observed failure modes, not as a claim of proven improvement. We will revise the architecture section to remove any phrasing that could be read as asserting efficacy and will instead label the proposal clearly as “planned future work” whose effectiveness will be assessed only after expert human grading and controlled experiments are completed. revision: yes
Circularity Check
No circularity: empirical deployment report with untested future redesign
full rationale
The paper reports observations from deploying BacPrep, collecting >100 student solutions, and documenting specific LLM failures (inconsistency, arithmetic errors, rubric issues). It then motivates a proposed redesign (subject-level decomposition, median selection) as future work, explicitly noting that expert human validation remains to be done. No equations, fitted parameters, predictions, or self-citations appear in the text. The derivation chain is purely observational and does not reduce any claim to its own inputs by construction; the redesign is presented as an untested hypothesis rather than a derived result.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The platform has collected over 100 student solutions across Computer Science and Romanian Language exams, enabling preliminary assessment of LLM grading quality. This revealed several significant challenges: grading inconsistency across multiple runs, arithmetic errors when aggregating fractional scores, performance degradation under large prompt contexts, failure to apply subject-specific rubric weightings, and internal inconsistencies between generated scores and qualitative feedback.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
These findings motivate a redesigned architecture featuring subject-level prompt decomposition, specialized per-subject graders, and a median-selection strategy across multiple runs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
OpenAI: GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Anthropic: Claude: https://www.anthropic.com/claude
-
[3]
Educational Psychologist46(4), 197–221 (2011)
VanLehn, K.: The Relative Effectiveness of Human Tutoring, Intelligent Tutoring Systems, and Other Tutoring Systems. Educational Psychologist46(4), 197–221 (2011)
work page 2011
-
[4]
Journal of Educational Psychology106(4), 901–918 (2014)
Ma, W., Adesope, O.O., Nesbit, J.C., Liu, Q.: Intelligent Tutoring Systems and Learning Outcomes: A Meta-Analysis. Journal of Educational Psychology106(4), 901–918 (2014)
work page 2014
-
[5]
Ihantola, P., Ahoniemi, T., Karavirta, V., Seppälä, O.: Review of Recent Systems for Automatic Assessment of Programming Assignments. In: Proceedings of the 10th Koli Calling International Conference on Computing Education Research, pp. 86–93 (2010)
work page 2010
-
[6]
Educational Measurement: Issues and Practice32(2), 3–14 (2013)
Shermis, M.D., Burstein, J.: Contrasting State-of-the-Art Automated Scoring of Essays. Educational Measurement: Issues and Practice32(2), 3–14 (2013)
work page 2013
-
[7]
The Journal of Technology, Learning and Assessment4(3) (2006)
Attali, Y., Burstein, J.: Automated Essay Scoring with e-rater V.2. The Journal of Technology, Learning and Assessment4(3) (2006)
work page 2006
-
[8]
arXiv preprint arXiv:2311.12780 (2023)
Cheng, P., Hao, W., Dai, S., Liu, J., Gan, Z., Carin, L.: GPT-Tutor: Learning to Teach Large Language Models. arXiv preprint arXiv:2311.12780 (2023)
-
[9]
Abd-alrazaq A, AlSaad R, Alhuwail D, Ahmed A, Healy PM, Latifi S, Aziz S, Damseh R, Alabed Alrazak S, Sheikh J Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions JMIR Med Educ 2023;9:e48291 doi: 10.2196/48291 PMID: 37261894 PMCID: 10273039 Title Suppressed Due to Excessive Length 9
-
[10]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao et al.: Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models. arXiv preprint arXiv:2206.04615 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., et al.: On the Opportunities and Risks of Foundation Models. arXiv preprint arXiv:2108.07258 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[12]
In: Zuin, A., Douligeris, C., Hanne, T
Salloum, S.A., Alhamad, A.Q.M., Al-Emran, M., Abdel Monem, A., Shaalan, K.: Factors Affecting the Adoption of Artificial Intelligence in the Lebanese Education Sector. In: Zuin, A., Douligeris, C., Hanne, T. (eds.) Proceedings of the Interna- tional Conference on Artificial Intelligence and Computer Science (AICS2019), pp. 384–396. Wuhan Hubei China (2019)
work page 2019
-
[13]
Istrate,O.:DigitalLiteracyandEducation.NationalPoliciesacrossEurope.In:Ro- ceanu,I.(ed.)Proceedingsofthe13thInternationalScientificConferenceeLearning and Software for Education (eLSE), vol. 1, pp. 67–73. Carol I National Defence University Publishing House, Bucharest (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.