pith. sign in

arxiv: 2604.26607 · v1 · submitted 2026-04-29 · 💻 cs.AI · cs.CY· cs.SE

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics

Pith reviewed 2026-05-07 13:19 UTC · model grok-4.3

classification 💻 cs.AI cs.CYcs.SE
keywords LLM benchmarkingcompetency-based educationsecondary mathematics assessmenthuman-in-the-looprubric evaluationmixture of expertseducational AI
0
0 comments X

The pith

Mixture-of-experts LLMs reach fair agreement with human graders on secondary math competencies while a 70B model shows none

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a human-in-the-loop benchmarking framework to evaluate how effectively different large language models can automate competency-based assessment in secondary mathematics. It builds a multi-dimensional rubric drawn from Nepal's Grade 10 Optional Mathematics curriculum that scores both specific topics and cross-cutting skills including comprehension, knowledge, operational fluency, and behavior and correlation. When tested against ground-truth ratings from two senior faculty members who agreed strongly with each other, Gemini-based sparse mixture-of-experts models produced fair weighted agreement while a much larger 70B model produced none. The authors interpret this as evidence that a model's ability to follow detailed instruction constraints matters more than raw parameter count for rubric-driven tasks. They conclude that current LLMs remain unsuitable for autonomous certification but can serve as useful assistive tools for preliminary evidence gathering.

Core claim

Gemini-based Mixture-of-Experts models achieved fair agreement (weighted kappa approximately 0.38) with senior faculty ground truth on a rubric covering four topics and four competencies, whereas the larger Orion 70B model achieved no agreement (weighted kappa of -0.0261), indicating that architectural compliance with assessment instructions outweighs model scale in constrained educational evaluation tasks.

What carries the argument

The multi-dimensional rubric for four curriculum topics and four cross-cutting competencies (Comprehension, Knowledge, Operational Fluency, Behavior and Correlation) applied inside a human-in-the-loop benchmarking setup that compares heterogeneous LLMs against faculty consensus ground truth.

If this is right

  • LLMs can accelerate preliminary evidence extraction from student work but cannot yet support autonomous competency certification.
  • Instruction-following architecture is a stronger predictor of success on rubric tasks than parameter count alone.
  • The high faculty inter-rater reliability of 0.8652 confirms the rubric as a stable benchmark for future model testing.
  • Mixed ensembles of open-weight and proprietary models can be practically deployed for assistive grading support.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future LLM training for education applications could emphasize constraint adherence over continued scaling.
  • Comparable benchmarks in other subjects or countries could identify which architectures suit specific assessment styles.
  • Widespread adoption of such assistive systems might lower barriers to implementing competency-based education where teacher time is limited.

Load-bearing premise

The two senior faculty assessments form a reliable ground truth and that performance gaps between models are caused by architectural differences rather than unstated variations in prompting or configuration.

What would settle it

Repeating the full benchmark under identical prompts, temperature settings, and adaptation procedures for every model and finding that the 70B model then reaches fair or higher agreement with the faculty raters.

Figures

Figures reproduced from arXiv: 2604.26607 by Aayush Acharya, Jatin Bhusal, Nancy Mahatha, Raunak Regmi.

Figure 1
Figure 1. Figure 1: Overview of the benchmark construction and evaluation pipeline. Human experts design the rubric, create the assessment, and produce the ground-truth annotations before LLMs are evaluated on the same rubric-based competency prediction task. analysis of where a student’s strengths lie and where weaknesses exist. This is where Competency-Based Education (CBE) plays a crucial role in addressing limitations of … view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchical competency evaluation rubric showing the mapping from 16 topic-specific labels to four global competency scores. 3.3 Rubrics Construction and Competency Mapping The rubric framework used in this study was designed to make direct connections between assessment items and observable student competencies, offering a framework for both human and LLM assessment. In contrast to general grading rubric… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of LLM performance. Plot (a) shows agreement with human experts, high￾lighting that while Nova leads, there is a significant gap relative to human-to-human consistency. Plot (b) shows the varying levels of agreement between model pairs view at source ↗
read the original abstract

As Competency-Based Education (CBE) is gaining traction around the world, the shift from marks-based assessment to qualitative competency mapping is a manual challenge for educators. This paper tackles the bottleneck issue by suggesting a "Human-in-the-Loop" benchmarking framework to assess the effectiveness of multiple LLMs in automating secondary-level mathematics assessment. Based on the Grade 10 Optional Mathematics curriculum in Nepal, we created a multi-dimensional rubric for four topics and four cross-cutting competencies: Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation. The multi-provider ensemble, consisted of open-weight models -- Eagle (Llama 3.1-8B) and Orion (Llama 3.3-70B) -- and proprietary frontier models Nova (Gemini 2.5 Flash) and Lyra (Gemini 3 Pro), was benchmarked against a ground truth defined by two senior mathematics faculty members (kappa_w = 0.8652). The findings show a marked "Architecture-compatibility gap". Although the Gemini-based Mixture-of-Experts (Sparse MoE) models achieved "Fair Agreement" (kappa_w ~ 0.38), the larger Orion (70B) model exhibited "No Agreement" (kappa_w = -0.0261), suggesting that architectural compliance with instruction constraints outweighs the scale of raw parameters in rubric-constrained tasks. We conclude that while LLMs are not yet suitable for autonomous certification, they provide high-value assistive support for preliminary evidence extraction within a "Human-in-the-Loop" framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This paper presents a human-in-the-loop benchmarking framework for assessing the performance of heterogeneous LLMs in automating competency-based evaluation of secondary-level mathematics. Using a rubric derived from Nepal's Grade 10 Optional Mathematics curriculum that evaluates four topics across competencies of Comprehension, Knowledge, Operational Fluency, and Behavior and Correlation, the authors compare open-weight models (Eagle 8B, Orion 70B) and Gemini-based models (Nova, Lyra) against ratings from two senior faculty members (inter-rater kappa_w = 0.8652). Key findings include fair agreement for the Gemini MoE models (kappa_w ~ 0.38) contrasted with no agreement for the larger Orion model (kappa_w = -0.0261), supporting the claim of an 'Architecture-compatibility gap' where model architecture influences rubric adherence more than parameter count. The authors conclude that LLMs offer value in assistive roles within human-in-the-loop systems but are not ready for autonomous certification.

Significance. Should the results prove robust to controls for prompting and hyperparameter standardization, the work offers a timely contribution to AI applications in education by demonstrating that larger models do not necessarily excel in structured, constraint-heavy tasks like rubric-based assessment. The explicit use of weighted kappa against human ground truth provides a transparent and falsifiable basis for comparison, and the human-in-the-loop emphasis aligns with practical deployment realities in competency-based education. This could inform both model developers and educators on the selection of LLMs for assessment support, particularly highlighting potential advantages of certain architectures like sparse MoE in following complex instructions.

major comments (2)
  1. [Methods] The experimental protocol does not specify whether a single shared prompt template, output format, temperature, and sampling parameters were applied uniformly to all models (Eagle, Orion, Nova, Lyra). Absent a verbatim prompt example or per-model configuration table, the causal link between the observed kappa_w gap (Gemini MoE ~0.38 vs. Orion -0.0261) and architectural differences cannot be isolated from possible implementation variations, which directly undermines the central 'Architecture-compatibility gap' conclusion.
  2. [Evaluation and Results] While the ground-truth kappa_w of 0.8652 between the two faculty raters is reported, the manuscript provides no details on the total number of student responses assessed, the sampling method, or any measures of rater consistency beyond the single kappa value. This makes it challenging to evaluate the statistical power and reliability of the model comparisons.
minor comments (2)
  1. [Abstract] The phrasing 'The multi-provider ensemble, consisted of' is grammatically incorrect and should be revised to 'The multi-provider ensemble consisted of' or 'consisting of' for improved readability.
  2. [Throughout] The term 'kappa_w' should be explicitly defined as weighted Cohen's kappa upon first mention to aid readers unfamiliar with the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which will help improve the clarity and robustness of our paper. We address the major comments below and will make the necessary revisions to the manuscript.

read point-by-point responses
  1. Referee: [Methods] The experimental protocol does not specify whether a single shared prompt template, output format, temperature, and sampling parameters were applied uniformly to all models (Eagle, Orion, Nova, Lyra). Absent a verbatim prompt example or per-model configuration table, the causal link between the observed kappa_w gap (Gemini MoE ~0.38 vs. Orion -0.0261) and architectural differences cannot be isolated from possible implementation variations, which directly undermines the central 'Architecture-compatibility gap' conclusion.

    Authors: We confirm that a single shared prompt template, consistent output format, and standardized parameters including temperature and sampling were applied uniformly across all models to ensure fair comparison and isolate architectural effects. The observed differences are thus attributable to model architecture rather than implementation variations. We will add a verbatim example of the prompt template and a table detailing the configuration for each model in the revised Methods section. revision: yes

  2. Referee: [Evaluation and Results] While the ground-truth kappa_w of 0.8652 between the two faculty raters is reported, the manuscript provides no details on the total number of student responses assessed, the sampling method, or any measures of rater consistency beyond the single kappa value. This makes it challenging to evaluate the statistical power and reliability of the model comparisons.

    Authors: We acknowledge the need for greater transparency in the evaluation details. We will revise the manuscript to include the total number of student responses assessed, a description of the sampling method, and additional measures of rater consistency beyond the single kappa value. This will allow readers to better evaluate the statistical power and reliability of the comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking against external human ground truth

full rationale

The paper performs an empirical comparison of LLM outputs to independent senior faculty ratings on a fixed rubric, computing weighted kappa agreement (human-human kappa_w = 0.8652). No equations, fitted parameters, self-citations, or ansatzes are invoked that reduce any claim to quantities defined by the paper's own inputs or prior author work. The architecture-vs-scale interpretation is an observational inference from the external benchmark results rather than a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; no free parameters are described. The work relies on standard statistical measures and introduces one observational concept without independent falsifiable evidence.

axioms (1)
  • standard math Weighted Cohen's kappa is an appropriate and sufficient measure for quantifying agreement between LLM-generated rubric scores and human expert scores on ordinal competency scales.
    Used as the primary metric for all reported agreement levels including the ground-truth inter-rater value.
invented entities (1)
  • Architecture-compatibility gap no independent evidence
    purpose: To account for the observed performance difference between model families on instruction-constrained rubric tasks.
    Presented as an interpretive finding from the kappa results; no independent evidence or falsifiable prediction is supplied.

pith-pipeline@v0.9.0 · 5599 in / 1460 out tokens · 99743 ms · 2026-05-07T13:19:35.127272+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    In: Understanding adult education and training, pp

    Chappell, C., Gonczi, A., Hager, P.: Competency-based education. In: Understanding adult education and training, pp. 191–205. Routledge (2020)

  2. [2]

    Ministry of Education, Science and Technology, Government of Nepal, Sanothimi, Bhaktapur, Nepal (2019)

    Curriculum Development Centre: Optional Mathematics Grade 10. Ministry of Education, Science and Technology, Government of Nepal, Sanothimi, Bhaktapur, Nepal (2019)

  3. [3]

    MedPharmRes (2025)

    Do, S.T., To, C.L., Huynh, Q.K.V ., Huynh, D.T., Nguyen, S.T.T., Le, P.T.L.: Trends and applications of artificial intelligence in competency-based education in medical programs: A scoping review. MedPharmRes (2025)

  4. [4]

    Journal of Midwifery & Women’s Health70(6), 865–881 (2025)

    Faucher, M.A., Sing, E., Harris, S., Hutson, E., Hoelscher, S.H.: Artificial intelligence and competency-based education: A rapid scoping review. Journal of Midwifery & Women’s Health70(6), 865–881 (2025)

  5. [5]

    Computers in Human Behavior Reports14, 100412 (2024)

    Fundi, M., Sanusi, I.T., Oyelere, S.S., Ayere, M.: Advancing AI education: Assessing kenyan in-service teachers’ preparedness for integrating artificial intelligence in competence-based curriculum. Computers in Human Behavior Reports14, 100412 (2024)

  6. [6]

    Sustainability17(13), 6098 (2025)

    Hochstetter-Diez, J., Negrier-Seguel, M., Diéguez-Rebolledo, M., Candia-Garrido, E., Vidal, E.: From mapping to action: SmartRubrics, an AI tool for competency-based assessment in engineering education. Sustainability17(13), 6098 (2025)

  7. [7]

    Artificial Intelligence Education Studies1(3), 14–27 (2025)

    Huang, X.: Designing human-AI orchestrated classrooms: Mechanisms, protocols, and gover- nance for competency-based education. Artificial Intelligence Education Studies1(3), 14–27 (2025)

  8. [8]

    Department for International Development (DFID) (2006)

    Leach, J., Ahmed, A., Makalima, S., Power, T.: Deep impact: An investigation of the use of information and communication technologies for teacher education in the global south: Researching the issues. Department for International Development (DFID) (2006)

  9. [9]

    World Journal of Engineering and Technology4(3), 193–199 (2016)

    Hernández-de Menéndez, M., Morales-Menendez, R.: Competency based education – current global practices. World Journal of Engineering and Technology4(3), 193–199 (2016)

  10. [10]

    Higher Education Studies15(4), 333–353 (2025)

    Nammanee, M., Jantakoon, T., Laoha, R.: AI assistant framework on competency-based learning for digital competency development. Higher Education Studies15(4), 333–353 (2025)

  11. [11]

    EURASIA Journal of Mathematics, Science and Technology Education19(8), em2307 (2023)

    Owan, V .J., Abang, K.B., Idika, D.O., Etta, E.O., Bassey, B.A.: Exploring the potential of artificial intelligence tools in educational measurement and assessment. EURASIA Journal of Mathematics, Science and Technology Education19(8), em2307 (2023)

  12. [12]

    Amfiteatru Economic26(65), 220–240 (2024)

    Radu, C., Ciocoiu, C.N., Veith, C., C˘at˘alin, R.: Artificial intelligence and competency-based education: A bibliometric analysis. Amfiteatru Economic26(65), 220–240 (2024)

  13. [13]

    In: 2023 7th International Conference on Information Technology (InCIT)

    Wangwiwattana, C., Tongvivat, Y .: Automating academic assessment: a large language model approach. In: 2023 7th International Conference on Information Technology (InCIT). pp. 330–334. IEEE (2023)

  14. [14]

    In: 2024 13th International Conference on Computer Technologies and Development (TechDev)

    Xu, Q., Gu, J., Lu, J.: Leveraging artificial intelligence and large language models for enhanced teaching and learning: A systematic literature review. In: 2024 13th International Conference on Computer Technologies and Development (TechDev). pp. 73–77. IEEE (2024)