pith. sign in

arxiv: 2604.09622 · v1 · submitted 2026-03-18 · 💻 cs.CY · cs.AI· cs.CL

Explainability and Certification of AI-Generated Educational Assessments

Pith reviewed 2026-05-15 09:05 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.CL
keywords AI-generated assessmentsexplainabilitycertificationBloom's taxonomySOLO taxonomyeducational AIauditabilitycognitive alignment
0
0 comments X

The pith

A framework adds self-rationalization and metadata to make AI-generated assessment items certifiable against Bloom's and SOLO taxonomies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework that uses self-rationalization, attribution-based analysis, and post-hoc verification to generate interpretable evidence of how AI-created test items align with established cognitive taxonomies. This evidence is packaged in a structured metadata schema that records provenance, alignment predictions, reviewer actions, and ethical indicators. A traffic-light workflow then separates items that can be auto-certified from those needing human review or rejection. The approach is tested on 500 computer science questions to show gains in transparency and auditability while lowering instructor workload.

Core claim

The framework produces interpretable cognitive-alignment evidence grounded in Bloom's and SOLO taxonomies and enables audit-ready documentation consistent with emerging governance requirements through the combination of self-rationalization, attribution-based analysis, post-hoc verification, and a certification metadata schema.

What carries the argument

The certification metadata schema paired with a traffic-light workflow that records provenance, alignment predictions, reviewer actions, and ethical indicators to distinguish auto-certifiable items from those requiring human review.

If this is right

  • Institutions gain a documented trail that supports accreditation reviews of AI-assisted assessments.
  • Some generated items can be accepted without further review, lowering the time instructors spend checking AI output.
  • Ethical indicators become part of the permanent record for each item, aiding compliance with governance rules.
  • The same metadata structure can be reused across different subject areas once the initial taxonomy mappings are set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The traffic-light thresholds could be tuned per discipline if later studies show subject-specific patterns in alignment accuracy.
  • Integration with existing learning-management systems would let the metadata schema feed directly into institutional audit logs.
  • Extending the framework to include student performance data on the generated items could test whether taxonomy alignment predicts actual learning gains.

Load-bearing premise

Self-rationalization and attribution-based analysis reliably capture true cognitive alignment with the taxonomies without systematic human validation for every generated item.

What would settle it

Human experts review a random sample of the auto-certified items and find that more than a small percentage show clear misalignment with the cognitive levels predicted by the framework.

Figures

Figures reproduced from arXiv: 2604.09622 by Antoun Yaacoub, Anuradha Kar, Zainab Assaghir.

Figure 1
Figure 1. Figure 1: Visual overview of the proposed pipeline: generation [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level structure of the proposed certification metadata schema, showing how provenance, [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end certification workflow for AI-generated assessment items, from generation through [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

The rapid adoption of generative artificial intelligence (AI) in educational assessment has created new opportunities for scalable item creation, personalized feedback, and efficient formative evaluation. However, despite advances in taxonomy alignment and automated question generation, the absence of transparent, explainable, and certifiable mechanisms limits institutional and accreditation-level acceptance. This chapter proposes a comprehensive framework for explainability and certification of AI-generated assessment items, combining self-rationalization, attribution-based analysis, and post-hoc verification to produce interpretable cognitive-alignment evidence grounded in Bloom's and SOLO taxonomies. A structured certification metadata schema is introduced to capture provenance, alignment predictions, reviewer actions, and ethical indicators, enabling audit-ready documentation consistent with emerging governance requirements. A traffic-light certification workflow operationalizes these signals by distinguishing auto-certifiable items from those requiring human review or rejection. A proof-of-concept study on 500 AI-generated computer science questions demonstrates the framework's feasibility, showing improved transparency, reduced instructor workload, and enhanced auditability. The chapter concludes by outlining ethical implications, policy considerations, and directions for future research, positioning explainability and certification as essential components of trustworthy, accreditation-ready AI assessment systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a framework for explainability and certification of AI-generated educational assessment items. It combines self-rationalization, attribution-based analysis, and post-hoc verification to produce interpretable cognitive-alignment evidence grounded in Bloom's and SOLO taxonomies. A structured certification metadata schema captures provenance and alignment predictions, while a traffic-light workflow distinguishes auto-certifiable items from those needing human review. Feasibility is shown in a proof-of-concept on 500 AI-generated computer science questions, claiming improved transparency, reduced workload, and enhanced auditability.

Significance. If the alignment predictions prove reliable against human standards, the framework could meaningfully advance institutional adoption of generative AI in assessment by supplying audit-ready documentation aligned with governance requirements, while addressing transparency gaps that currently limit accreditation acceptance.

major comments (2)
  1. [Proof-of-concept study] Proof-of-concept study: The study on 500 items claims feasibility, transparency gains, and workload reduction, yet reports no quantitative metrics (e.g., agreement rates between framework predictions and independent expert ratings on Bloom/SOLO levels, precision of alignment classifications, or comparisons against non-AI baselines). This absence leaves the central claim of reliable interpretable cognitive-alignment evidence unanchored and unquantified.
  2. [Framework description] Framework description (Abstract and § on self-rationalization): The approach assumes that AI self-rationalization plus attribution analysis reliably recovers true cognitive levels without systematic human validation for each item. No external anchor or error analysis is provided, so the traffic-light certification decisions rest on internal consistency that may diverge from educational standards.
minor comments (2)
  1. [Certification workflow] The manuscript would benefit from explicit pseudocode or a detailed diagram of the traffic-light workflow to clarify decision thresholds and reviewer actions.
  2. [Certification metadata schema] Ensure the metadata schema section includes concrete examples of populated fields for provenance and ethical indicators to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have prepared revisions to strengthen the quantitative grounding and validation aspects of the framework.

read point-by-point responses
  1. Referee: [Proof-of-concept study] Proof-of-concept study: The study on 500 items claims feasibility, transparency gains, and workload reduction, yet reports no quantitative metrics (e.g., agreement rates between framework predictions and independent expert ratings on Bloom/SOLO levels, precision of alignment classifications, or comparisons against non-AI baselines). This absence leaves the central claim of reliable interpretable cognitive-alignment evidence unanchored and unquantified.

    Authors: We agree that the proof-of-concept would be strengthened by explicit quantitative metrics. In the revised manuscript we will add agreement rates (Cohen's kappa) between the framework's Bloom/SOLO predictions and independent expert ratings on a stratified subsample of 100 items, precision/recall for alignment classifications, and a comparison against a simple keyword-baseline classifier. These metrics were computed post-submission and will be reported with confidence intervals and error analysis. revision: yes

  2. Referee: [Framework description] Framework description (Abstract and § on self-rationalization): The approach assumes that AI self-rationalization plus attribution analysis reliably recovers true cognitive levels without systematic human validation for each item. No external anchor or error analysis is provided, so the traffic-light certification decisions rest on internal consistency that may diverge from educational standards.

    Authors: The traffic-light workflow is intentionally conservative: items receive 'green' only when self-rationalization, attribution scores, and taxonomy alignment all exceed internal thresholds; otherwise they are routed to human review. We acknowledge the absence of systematic external validation in the original submission. The revision will include a dedicated error-analysis subsection reporting disagreement cases and preliminary results from a small-scale human validation study (n=50 items) that anchors the framework predictions against expert judgments. revision: partial

Circularity Check

0 steps flagged

No significant circularity; framework proposal is self-contained

full rationale

The manuscript proposes a conceptual framework combining self-rationalization, attribution analysis, and certification metadata without any equations, fitted parameters, or derivations that reduce to the paper's own inputs. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims; the proof-of-concept on 500 questions is presented as a feasibility demonstration rather than a self-referential prediction. The central claims rest on established external taxonomies (Bloom's, SOLO) and governance requirements, remaining independent of the framework's internal signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central proposal rests on the assumption that existing cognitive taxonomies can be used for reliable alignment prediction and on newly introduced constructs whose independent validation is not supplied.

axioms (1)
  • domain assumption Bloom's and SOLO taxonomies provide accurate and sufficient representations of cognitive levels for assessment alignment
    Framework uses these taxonomies as the grounding for alignment predictions and evidence generation.
invented entities (2)
  • Certification metadata schema no independent evidence
    purpose: Captures provenance, alignment predictions, reviewer actions, and ethical indicators for auditability
    New schema introduced by the paper with no prior independent evidence cited.
  • Traffic-light certification workflow no independent evidence
    purpose: Operationalizes signals to distinguish auto-certifiable items from those needing human review or rejection
    New workflow proposed by the paper with no prior independent evidence cited.

pith-pipeline@v0.9.0 · 5510 in / 1254 out tokens · 48754 ms · 2026-05-15T09:05:02.495880+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Assessing AI-generated questions’ alignment with cognitive frameworks in educational assessment.International Journal of Computer Theory and Engineering, 17(3):114–125, 2025

    Yaacoub, A., Da-Rugna, J., and Assaghir, Z. Assessing AI-generated questions’ alignment with cognitive frameworks in educational assessment.International Journal of Computer Theory and Engineering, 17(3):114–125, 2025. doi:10.7763/IJCTE.2025.V17.1374

  2. [2]

    Cognitive depth enhancement in AI-driven educational tools via SOLO taxonomy

    Yaacoub, A., Assaghir, Z., and Da-Rugna, J. Cognitive depth enhancement in AI-driven educational tools via SOLO taxonomy. InProceedings of the Third International Conference on Advances in Com- puting Research (ACR’25), pp. 14–25, 2025. doi:10.1007/978-3-031-87647-9_2

  3. [3]

    Analyzing feedback mechanisms in AI- generated MCQs: Insights into readability, lexical properties, and levels of challenge.arXiv preprint arXiv:2504.21013, April 2025

    Yaacoub, A., Assaghir, Z., Prevost, L., and Da-Rugna, J. Analyzing feedback mechanisms in AI- generated MCQs: Insights into readability, lexical properties, and levels of challenge.arXiv preprint arXiv:2504.21013, April 2025. doi:10.48550/arXiv.2504.21013

  4. [4]

    Yaacoub, A. et al. Enhancing AI-driven education. InProceedings of the International Joint Conference on Neural Networks (IJCNN), pp. 1–7, 2025. doi:10.1109/IJCNN64981.2025.11229046

  5. [5]

    Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act)

    European Commission. Proposal for a regulation laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). COM(2021) 206 final, 2021

  6. [6]

    Paris, France, 2023

    UNESCO.Guidance for Generative AI in Education and Research. Paris, France, 2023

  7. [7]

    and Ventura, M.Stealth Assessment: Measuring and Supporting Learning in Video Games

    Shute, V. and Ventura, M.Stealth Assessment: Measuring and Supporting Learning in Video Games. MIT Press, 2013. doi:10.7551/mitpress/9589.001.0001

  8. [8]

    Baker, R. S. and Inventado, P. S. Educational data mining and learning analytics. In Larusson, J. A. and White, B. (eds.),Learning Analytics: From Research to Practice, pp. 61–75. Springer, New York,

  9. [9]

    doi:10.1007/978-1-4614-3305-7_4

  10. [10]

    OneClickQuiz: Instant GEN AI-driven quiz generation in Moodle

    Yaacoub, A., Haidar, S., and Da-Rugna, J. OneClickQuiz: Instant GEN AI-driven quiz generation in Moodle. InProceedings of the Conference on Sustainable Energy Education (SEED), pp. 689–698, 2024

  11. [11]

    Zawacki-Richter, V

    Zawacki-Richter, O., Marin, V. I., Bond, M., and Gouverneur, F. Systematic review of research on artificial intelligence applications in higher education.International Journal of Educational Technology in Higher Education, 16(1):39, 2019. doi:10.1186/s41239-019-0171-0

  12. [12]

    A framework for generative AI-driven assess- ment in higher education.Information, 16(6):472, 2025

    Ilieva, G., Yankova, T., Ruseva, M., and Kabaivanov, S. A framework for generative AI-driven assess- ment in higher education.Information, 16(6):472, 2025. doi:10.3390/info16060472

  13. [13]

    Chiu, T. K. F., Ahmad, Z., Ismailov, M., and Sanusi, I. T. What are artificial intelligence literacy and competency?Computers and Education Open, 6:100171, 2024. doi:10.1016/j.caeo.2024.100171

  14. [14]

    Zhao, J., Chapman, E., and Sabet, P. G. Generative AI and educational assessments: A systematic re- view.Education Research and Perspectives, 51:124–155, December 2024. doi:10.70953/ERPv51.2412006

  15. [15]

    Anderson, L. W. and Krathwohl, D. R.A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom’s Taxonomy of Educational Objectives. Addison Wesley Longman, 2001

  16. [16]

    Biggs, J. B. and Collis, K. F.Evaluating the Quality of Learning: The SOLO Taxonomy (Structure of the Observed Learning Outcome). Academic Press, 2014

  17. [17]

    PaFPN-SOLO: A SOLO-based image instance segmentation algorithm

    Li, B.-R., Zhang, J.-K., and Liang, Y. PaFPN-SOLO: A SOLO-based image instance segmentation algorithm. InProceedings of the Asia Conference on Algorithms, Computing and Machine Learning (CACML), pp. 557–564, 2022. doi:10.1109/CACML55074.2022.00100

  18. [18]

    InAdvances in Learning Analytics and Educational Technology, pp

    Ebner, M., Brünner, B., Forjan, N., and Schön, S.Ensuring Quality in AI-Generated Multiple-Choice Questions for Higher Education with the QUEST Framework. InAdvances in Learning Analytics and Educational Technology, pp. 293–303, June 2025. doi:10.1007/978-3-031-95627-0_20

  19. [19]

    Towards A Rigorous Science of Interpretable Machine Learning

    Doshi-Velez, F. and Kim, B. Towards a rigorous science of interpretable machine learning.arXiv preprint arXiv:1702.08608, 2017

  20. [20]

    A survey of methods for explaining black box models.ACM Comput

    Guidotti, R., Monreale, A., Ruggieri, S., Turini, F., Giannotti, F., andPedreschi, D.Asurveyofmethods for explaining black box models.ACM Computing Surveys, 51(5):93:1–93:42, 2018. doi:10.1145/3236009

  21. [21]

    Why Should I Trust You?

    Ribeiro, M. T., Singh, S., and Guestrin, C. Why should I trust you? Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1135–1144, 2016. doi:10.1145/2939672.2939778

  22. [22]

    Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. InAdvances in Neural Information Processing Systems, vol. 30, 2017

  23. [23]

    and Yau, J

    Ifenthaler, D. and Yau, J. Y.-K. Utilising learning analytics to support study success in higher educa- tion: A systematic review.Educational Technology Research and Development, 68(4):1961–1990, 2020. doi:10.1007/s11423-020-09788-z

  24. [24]

    and Doroudi, S

    Holstein, K. and Doroudi, S. Fairness and equity in learning analytics systems. InProceedings of the Learning Analytics and Knowledge Conference (LAK), 2019

  25. [25]

    Washington, DC, USA, 2022

    Council for Higher Education Accreditation.Standards and Guidelines for Accreditation. Washington, DC, USA, 2022

  26. [26]

    Lightweight prompt engineering for cognitive align- ment in educational AI: A OneClickQuiz case study.arXiv preprintarXiv:2510.03374, 2025

    Yaacoub, A., Assaghir, Z., and Da-Rugna, J. Lightweight prompt engineering for cognitive align- ment in educational AI: A OneClickQuiz case study.arXiv preprintarXiv:2510.03374, 2025. doi:10.48550/arXiv.2510.03374