AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains

Seine A. Shintani

arxiv: 2604.19751 · v1 · submitted 2026-03-16 · 💻 cs.AI · cs.CY

AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains

Seine A. Shintani This is my paper

Pith reviewed 2026-05-15 10:28 UTC · model grok-4.3

classification 💻 cs.AI cs.CY

keywords AI governancedeliverable-oriented frameworkmaturity rubricopaque AIlearning-intensive domainscapability residualauditabilitygenerative AI

0 comments

The pith

AI to Learn 2.0 lets opaque AI assist early exploration and drafting but requires the final deliverable to remain fully usable, auditable, and justifiable without the original model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes AI to Learn 2.0, a deliverable-oriented governance framework for generative AI in research, education, and professional work. It addresses proxy failure by shifting focus from the creation process to whether the released artifact still demonstrates human understanding and transfer. The framework permits AI during hypothesis generation and workflow design yet demands that the output package be standalone, with human-attributable evidence of explanation in learning contexts. A five-part deliverable structure combined with a seven-dimension maturity rubric and gate thresholds is used to separate polished AI substitution from bounded, auditable assistance. Worked examples across coursework, symbolic regression, exam forms, and lecture pipelines illustrate the distinction in practice.

Core claim

The central claim is that reorganizing governance around the final deliverable package, while distinguishing artifact residual from capability residual, allows opaque AI use in early stages provided the released work is usable, auditable, transferable, and justifiable without the original large language model or cloud API; in learning-intensive domains this additionally requires context-appropriate human-attributable evidence of explanation or transfer, operationalized through a five-part package, seven-dimension maturity rubric, gate thresholds, and capability-evidence ladder.

What carries the argument

The seven-dimension maturity rubric with gate thresholds on critical dimensions, which operationalizes the separation of artifact residual from capability residual through a five-part deliverable package and companion capability-evidence ladder.

If this is right

The framework separates polished substitution workflows from bounded, auditable, and handoff-ready AI-assisted workflows in worked scoring across contrastive cases.
It permits opaque AI during exploration, drafting, hypothesis generation, and workflow design while enforcing standalone usability for the released deliverable.
In learning-intensive contexts it adds the requirement for context-appropriate human-attributable evidence of explanation or transfer.
The approach is positioned as a governance instrument for structured third-party review where capability preservation, accountability, and validity boundaries matter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same deliverable-focused lens could be applied to professional certifications or regulatory submissions outside formal education to enforce accountability.
Longitudinal tracking of rubric scores across repeated tasks might reveal whether repeated AI-assisted patterns erode or preserve measurable transfer skills.
Threshold values on the rubric dimensions could be empirically tuned by collecting scored deliverables from multiple institutions and correlating them with external capability tests.

Load-bearing premise

That a seven-dimension maturity rubric with gate thresholds can reliably separate polished AI substitution from bounded, auditable AI-assisted work across varied learning contexts without empirical calibration or validation data.

What would settle it

Independent reviewers applying the rubric to matched sets of AI-assisted and non-assisted deliverables from the same learning task would produce scores that fail to correlate with independent measures of human explanation ability or transfer performance.

Figures

Figures reproduced from arXiv: 2604.19751 by Seine A. Shintani.

**Figure 2.** Figure 2: Descriptive total scores S(w) for the worked examples. The dashed horizontal lines indicate the interpretation bands proposed in Section 6. The annotations remind the reader that gate satisfaction and, where relevant, capability residual remain decisive. 7.7 Interpretation of the worked cases The worked cases clarify three points. First, capability evidence cannot repair a weak deliverable package by itsel… view at source ↗

read the original abstract

Generative AI is entering research, education, and professional work faster than current governance frameworks can specify how AI-assisted outputs should be judged in learning-intensive settings. The central problem is proxy failure: a polished artifact can be useful while no longer serving as credible evidence of the human understanding, judgment, or transfer ability that the work is supposed to cultivate or certify. This paper proposes AI to Learn 2.0, a deliverable-oriented governance framework for AI-assisted work. Rather than claiming element-wise novelty, it reorganizes adjacent ideas around the final deliverable package, distinguishes artifact residual from capability residual, and operationalizes the result through a five-part package, a seven-dimension maturity rubric, gate thresholds on critical dimensions, and a companion capability-evidence ladder. AI to Learn 2.0 allows opaque AI during exploration, drafting, hypothesis generation, and workflow design, but requires that the released deliverable be usable, auditable, transferable, and justifiable without the original large language model or cloud API. In learning-intensive contexts, it additionally requires context-appropriate human-attributable evidence of explanation or transfer. Worked scoring across contrastive cases, including coursework substitution, a symbolic-regression governance contrast, teacher-audited national-exam practice forms, and a self-hosted lecture-to-quiz pipeline with deterministic quality control, shows how the framework separates polished substitution workflows from bounded, auditable, and handoff-ready AI-assisted workflows. AI to Learn 2.0 is proposed as a governance instrument for structured third-party review where capability preservation, accountability, and validity boundaries matter.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable structure for letting AI handle early stages of learning tasks while requiring the final deliverable to stand alone, but the rubric thresholds lack any calibration or reliability checks.

read the letter

The paper's main move is to reorganize existing governance ideas around the final deliverable rather than the process. It distinguishes artifact residual from capability residual, then builds a five-part package, a seven-dimension maturity rubric, and gate thresholds that allow opaque AI for exploration and drafting but require the released work to be auditable and transferable without the original model. In learning contexts it adds a need for human-attributable evidence of explanation or transfer. The contrastive cases on coursework substitution, symbolic regression, national-exam forms, and a lecture-to-quiz pipeline make the scoring rules concrete and show how the framework would flag polished substitution versus bounded assistance. That practical focus is the clearest strength; it gives administrators something they can actually apply to policy without banning AI outright. The soft spot is exactly where the stress-test note lands: the gate thresholds rest on the author's own scoring of four cases with no calibration data, inter-rater statistics, or external validation against learning outcomes. Because the rubric is meant for third-party review, that absence directly limits how much weight the operational claims can carry right now. This is for people setting AI rules in education or professional training programs who need an instrument more than a new theorem. It is coherent on its own terms and shows clear thinking about the proxy-failure problem, so it deserves peer review to test whether the dimensions and gates hold up under real use and to surface validation steps. I would bring it to a reading group for discussion but would not cite it as an empirical result.

Referee Report

1 major / 1 minor

Summary. The paper proposes AI to Learn 2.0, a deliverable-oriented governance framework for opaque AI use in learning-intensive domains. It distinguishes artifact residual from capability residual, specifies a five-part deliverable package, introduces a seven-dimension maturity rubric with gate thresholds on critical dimensions, and includes a capability-evidence ladder. The framework permits AI during exploration, drafting, and workflow design but requires the released deliverable to be usable, auditable, transferable, and justifiable without the original model or API; in learning contexts it further requires context-appropriate human-attributable evidence of explanation or transfer. The approach is illustrated via worked scoring on four contrastive cases (coursework substitution, symbolic regression, national-exam forms, and a lecture-to-quiz pipeline).

Significance. If the rubric thresholds prove reliable after calibration, the framework could supply a practical, third-party-reviewable instrument for preserving capability development and accountability while allowing AI assistance, directly addressing proxy failure between polished artifacts and credible evidence of human understanding.

major comments (1)

[Abstract (worked scoring across contrastive cases) and the section presenting the seven-dimension rubric and gate thres] The central operational claim—that the seven-dimension maturity rubric with explicit gate thresholds reliably separates polished AI substitution from bounded, auditable, handoff-ready work—is supported only by author-selected and author-scored illustrations on four contrastive cases. No empirical calibration, inter-rater reliability statistics, correlation with external learning-outcome measures, or cross-context validation data are reported, leaving the decision rules untested for the third-party review use case described in the abstract.

minor comments (1)

[Abstract] The abstract states that the framework 'reorganizes adjacent ideas' but does not list the specific prior governance or assessment frameworks being reorganized; adding explicit citations in the introduction would improve traceability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for identifying the need to clarify the evidential basis of the proposed rubric. We respond to the major comment below, distinguishing the manuscript's scope as a framework proposal from the empirical validation that would be required for operational deployment.

read point-by-point responses

Referee: [Abstract (worked scoring across contrastive cases) and the section presenting the seven-dimension rubric and gate thres] The central operational claim—that the seven-dimension maturity rubric with explicit gate thresholds reliably separates polished AI substitution from bounded, auditable, handoff-ready work—is supported only by author-selected and author-scored illustrations on four contrastive cases. No empirical calibration, inter-rater reliability statistics, correlation with external learning-outcome measures, or cross-context validation data are reported, leaving the decision rules untested for the third-party review use case described in the abstract.

Authors: We agree that the four contrastive cases constitute author-selected and author-scored illustrations rather than empirical validation. The manuscript frames AI to Learn 2.0 as a conceptual governance framework whose primary contribution is the deliverable-oriented reorganization, the artifact-versus-capability residual distinction, the five-part package, and the operationalization of the seven-dimension rubric with gate thresholds. The worked examples are presented explicitly to demonstrate application and to show how the decision rules would function in practice; they are not offered as statistical evidence of reliability or predictive validity. We acknowledge that third-party use would require subsequent calibration studies, inter-rater reliability assessment, and correlation with external learning-outcome measures. To address the comment we have added a new Limitations and Future Work section that states the illustrative character of the current cases and specifies the empirical work needed to calibrate thresholds and test generalizability across contexts. revision: partial

Circularity Check

0 steps flagged

No significant circularity; rubric and framework defined independently

full rationale

The paper's derivation consists of definitional reorganization of existing governance concepts around the final deliverable package, distinguishing artifact residual from capability residual, and specifying a five-part package plus seven-dimension rubric with explicit gate thresholds. These elements are introduced by direct construction without reference to fitted parameters, self-referential equations, or load-bearing self-citations. The contrastive case illustrations (coursework substitution, symbolic regression, national-exam forms, lecture-to-quiz pipeline) function as application examples rather than the source of the thresholds or the justification for the framework. No uniqueness theorems, ansatzes, or renamings of known results are invoked in a self-referential manner. The central claims therefore remain self-contained and do not reduce to their own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The framework rests on domain assumptions about proxy failure in learning and introduces new constructs to operationalize governance without independent empirical grounding for the rubric thresholds.

free parameters (1)

gate thresholds on critical dimensions
Minimum scores required on key rubric dimensions are proposed as operational parameters without derivation from data or external benchmarks.

axioms (1)

domain assumption A polished artifact can be useful while no longer serving as credible evidence of the human understanding, judgment, or transfer ability that the work is supposed to cultivate or certify.
This proxy failure premise is stated as the central problem driving the entire framework.

invented entities (3)

artifact residual no independent evidence
purpose: To isolate the portion of the deliverable attributable to AI assistance versus human contribution.
New distinction introduced to focus evaluation on what remains after AI use.
capability residual no independent evidence
purpose: To capture the human-attributable evidence of understanding and transfer in the final deliverable.
Complements artifact residual to enforce learning goals.
capability-evidence ladder no independent evidence
purpose: Structured progression for documenting human capability evidence.
Operational tool paired with the rubric.

pith-pipeline@v0.9.0 · 5592 in / 1555 out tokens · 50283 ms · 2026-05-15T10:28:33.876147+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The core AI to Learn deliverable package is D=(A,P,V,F,R)... Gate(M(w))=1 when ... (m1≥3)∧(m2≥3)∧(m4≥3)∧(m5≥3)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages

[1]

OECD Publishing, Paris, 2026.https://doi.org/10.1787/062a7394-en

OECD.OECD Digital Education Outlook 2026: Exploring Effective Uses of Generative AI in Education. OECD Publishing, Paris, 2026.https://doi.org/10.1787/062a7394-en

work page doi:10.1787/062a7394-en 2026
[2]

F. Miao, W. Holmes, R. Huang, and H. Zhang.Guidance for Generative AI in Education and Research. UNESCO, 2023.https://unesdoc.unesco.org/ark:/48223/pf0000386693

work page 2023
[3]

UNESCO, 2024.https://unesdoc

UNESCO.AI Competency Framework for Teachers. UNESCO, 2024.https://unesdoc. unesco.org/ark:/48223/pf0000391104

work page 2024
[4]

2023, doi: 10.6028/NIST.AI.100-1

E. Tabassi.Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023.https://doi.org/10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023
[5]

NIST Trustworthy and Responsible AI NIST AI 600-1 Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile,

C. Autio et al.Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1, 2024.https://doi.org/10.6028/NIST.AI.600-1

work page doi:10.6028/nist.ai.600-1 2024
[6]

Created July 8, 2022; updated March 18, 2025; accessed March 2026.https://www.nist.gov/itl/ ai-risk-management-framework/nist-ai-rmf-playbook-faqs

National Institute of Standards and Technology.NIST AI RMF Playbook FAQs. Created July 8, 2022; updated March 18, 2025; accessed March 2026.https://www.nist.gov/itl/ ai-risk-management-framework/nist-ai-rmf-playbook-faqs

work page 2022
[7]

Batool, D

A. Batool, D. Zowghi, and M. Bano. AI governance: A systematic literature review.AI and Ethics, 5:3265–3279, 2025.https://doi.org/10.1007/s43681-024-00653-w. 22

work page doi:10.1007/s43681-024-00653-w 2025
[8]

X. Han, H. Peng, and M. Liu. The impact of GenAI on learning outcomes: A systematic review and meta-analysis of experimental studies.Educational Research Review, 48:100714, 2025.https://doi.org/10.1016/j.edurev.2025.100714

work page doi:10.1016/j.edurev.2025.100714 2025
[9]

Prather et al

J. Prather et al. The widening gap: The benefits and harms of generative AI for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research, pages 469–486, 2024

work page 2024
[10]

J. Zhi, H. Kumar, and M. Lee. Investigating the effects of LLM use on critical thinking under time constraints: Access timing and time availability. arXiv:2603.08849, 2026

work page arXiv 2026
[11]

Dawson, M

P. Dawson, M. Bearman, M. Dollinger, and D. Boud. Validity matters more than cheating. Assessment & Evaluation in Higher Education, 49(7):1005–1016, 2024.https://doi.org/ 10.1080/02602938.2024.2386662

work page doi:10.1080/02602938.2024.2386662 2024
[12]

Perkins, L

M. Perkins, L. Furze, J. Roe, and J. MacVaugh. The Artificial Intelligence Assessment Scale (AIAS): A framework for ethical integration of generative AI in educational assessment. Journal of University Teaching & Learning Practice, 21(6), 2024.https://doi.org/10. 53761/q3azde36

work page 2024
[13]

Perkins, J

M. Perkins, J. Roe, and L. Furze. Reimagining the Artificial Intelligence Assessment Scale: A refined framework for educational assessment.Journal of University Teaching & Learning Practice, 22(7), 2025.https://doi.org/10.53761/rrm4y757

work page doi:10.53761/rrm4y757 2025
[14]

Tregloan and H

K. Tregloan and H. S. Song. From How Much to Whodunnit: A framework for authorising and evaluating student AI use. In T. Cochrane et al. (eds.),Proceedings ASCILITE 2024, pages 255–265, 2024.https://doi.org/10.14742/apubs.2024.1441

work page doi:10.14742/apubs.2024.1441 2024
[15]

J. M. Lodge, S. Howard, M. Bearman, P. Dawson, and Associates.Assess- ment reform for the age of artificial intelligence. Tertiary Education Quality and Standards Agency, Australian Government, 2023. Landing page: https: //www.teqsa.gov.au/guides-resources/resources/corporate-publications/ assessment-reform-age-artificial-intelligence

work page 2023
[16]

Tertiary Education Quality and Standards Agency.Gen AI Strategies for Australian Higher Education: Emerging Practice. 2024. https:// www.teqsa.gov.au/guides-resources/resources/corporate-publications/ gen-ai-strategies-australian-higher-education-emerging-practice

work page 2024
[17]

J. M. Lodge, M. Bearman, P. Dawson, H. Gniel, R. Harper, D. Liu, J. McLean, L. Ucnik, and Associates.Enacting assessment reform in a time of artificial intelligence. Tertiary Education Quality and Standards Agency, Aus- tralian Government, 2025.https://www.teqsa.gov.au/sites/default/files/2025-09/ enacting-assessment-reform-in-a-time-of-artificial-intelli...

work page 2025
[18]

A.-M. Chase. Assessment by design: A classification framework for learning assurance in the age of GenAI. InProceedings ASCILITE 2025, pages 615–624, 2025.https://doi.org/ 10.65106/apubs.2025.2785

work page doi:10.65106/apubs.2025.2785 2025
[19]

Henderson, M

T. Corbin, P. Dawson, and D. Liu. Talk is cheap: Why structural assessment changes are needed for a time of GenAI.Assessment & Evaluation in Higher Education, 50(7):1087–1097, 2025.https://doi.org/10.1080/02602938.2025.2503964

work page doi:10.1080/02602938.2025.2503964 2025
[20]

Why Should I Trust You?

M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.https: //doi.org/10.1145/2939672.2939778. 23

work page doi:10.1145/2939672.2939778 2016
[21]

S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774, 2017.https: //dl.acm.org/doi/10.5555/3295222.3295230

work page doi:10.5555/3295222.3295230 2017
[22]

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019. https://doi.org/10.1038/s42256-019-0048-x

work page doi:10.1038/s42256-019-0048-x 2019
[23]

Clear Sanctions, Vague Rewards: How China’s Social Credit System Currently Defines

M. Mitchell et al. Model cards for model reporting. InProceedings of the 2019 Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019.https://doi.org/10. 1145/3287560.3287596

work page arXiv 2019
[24]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. https://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021
[25]

Human-in-the-loop machine learning: a state of the art,

E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-Bascarán, and Á. Fernández-Leal. Human-in-the-loop machine learning: a state of the art.Artificial Intelli- gence Review, 56:3005–3054, 2023.https://doi.org/10.1007/s10462-022-10246-w

work page doi:10.1007/s10462-022-10246-w 2023
[26]

Smith, and Oren Etzioni

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green AI.Communications of the ACM, 63(12):54–63, 2020.https://doi.org/10.1145/3381831

work page doi:10.1145/3381831 2020
[27]

Natarajan, S

S. Natarajan, S. Mathur, S. Sidheekh, W. Stammer, and K. Kersting. Human-in-the-loop or AI-in-the-loop? Automate or collaborate?Proceedings of the AAAI Conference on Artificial Intelligence, 39(27):29473–29480, 2025.https://doi.org/10.1609/aaai.v39i27.35083

work page doi:10.1609/aaai.v39i27.35083 2025
[28]

Vinci, L

V. Vinci, L. S. Agrati, P. Berardi, and A. Beri. Artificial intelligence as a catalyst for transformative assessment: Designing teacher literacy at the crossroads of ethics, pedagogy, and human relationships.Frontiers in Education, 11:1760626, 2026.https://doi.org/10. 3389/feduc.2026.1760626

work page arXiv 2026
[29]

Shintani.AI to Learn (AI2L): Human-Centered Guidelines for Black-Box-Free AI and Empirical Law Discovery via Symbolic Regression

S. Shintani.AI to Learn (AI2L): Human-Centered Guidelines for Black-Box-Free AI and Empirical Law Discovery via Symbolic Regression. Jxiv preprint, 2025.https://doi.org/ 10.51094/jxiv.1435

work page doi:10.51094/jxiv.1435 2025
[30]

S. A. Shintani. Self-hosted Lecture-to-Quiz: Local LLM MCQ generation with deterministic quality control. arXiv:2603.08729, 2026.https://arxiv.org/abs/2603.08729

work page arXiv 2026
[31]

T. K. Koo and M. Y. Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016
[32]

Vanbelle et al

S. Vanbelle et al. A comprehensive guide to study the agreement and reliability of ordinal scores in the presence of multiple observers.BMC Medical Research Methodology, 24, 2024. https://doi.org/10.1186/s12874-024-02431-y

work page doi:10.1186/s12874-024-02431-y 2024
[33]

Kangwa, M

D. Kangwa, M. M. Msafiri, and A. Fute. Balancing innovation and ethics: Promote academic integrity through support and effective use of GenAI tools in higher education.AI and Ethics, 5(4):3497–3530, 2025.https://doi.org/10.1007/s43681-025-00689-6

work page doi:10.1007/s43681-025-00689-6 2025
[34]

P. D. N. Ncube, G. P. Dzvapatsva, C. Matobobo, and M. M. Ranga. Redefining student assessment in AI-infused learning environments: A systematic review of challenges and strategies for academic integrity.AI and Ethics, 6:68, 2026.https://doi.org/10.1007/ s43681-025-00871-w. 24

work page 2026
[35]

Coates, G

H. Coates, G. Croucher, and A. Calderon. Governing academic integrity: Ensuring the authenticity of higher thinking in the era of generative artificial intelligence.Journal of Academic Ethics, 23:2015–2028, 2025.https://doi.org/10.1007/s10805-025-09639-7. 25

work page doi:10.1007/s10805-025-09639-7 2015

[1] [1]

OECD Publishing, Paris, 2026.https://doi.org/10.1787/062a7394-en

OECD.OECD Digital Education Outlook 2026: Exploring Effective Uses of Generative AI in Education. OECD Publishing, Paris, 2026.https://doi.org/10.1787/062a7394-en

work page doi:10.1787/062a7394-en 2026

[2] [2]

F. Miao, W. Holmes, R. Huang, and H. Zhang.Guidance for Generative AI in Education and Research. UNESCO, 2023.https://unesdoc.unesco.org/ark:/48223/pf0000386693

work page 2023

[3] [3]

UNESCO, 2024.https://unesdoc

UNESCO.AI Competency Framework for Teachers. UNESCO, 2024.https://unesdoc. unesco.org/ark:/48223/pf0000391104

work page 2024

[4] [4]

2023, doi: 10.6028/NIST.AI.100-1

E. Tabassi.Artificial Intelligence Risk Management Framework (AI RMF 1.0). NIST AI 100-1, 2023.https://doi.org/10.6028/NIST.AI.100-1

work page doi:10.6028/nist.ai.100-1 2023

[5] [5]

NIST Trustworthy and Responsible AI NIST AI 600-1 Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile,

C. Autio et al.Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile. NIST AI 600-1, 2024.https://doi.org/10.6028/NIST.AI.600-1

work page doi:10.6028/nist.ai.600-1 2024

[6] [6]

Created July 8, 2022; updated March 18, 2025; accessed March 2026.https://www.nist.gov/itl/ ai-risk-management-framework/nist-ai-rmf-playbook-faqs

National Institute of Standards and Technology.NIST AI RMF Playbook FAQs. Created July 8, 2022; updated March 18, 2025; accessed March 2026.https://www.nist.gov/itl/ ai-risk-management-framework/nist-ai-rmf-playbook-faqs

work page 2022

[7] [7]

Batool, D

A. Batool, D. Zowghi, and M. Bano. AI governance: A systematic literature review.AI and Ethics, 5:3265–3279, 2025.https://doi.org/10.1007/s43681-024-00653-w. 22

work page doi:10.1007/s43681-024-00653-w 2025

[8] [8]

X. Han, H. Peng, and M. Liu. The impact of GenAI on learning outcomes: A systematic review and meta-analysis of experimental studies.Educational Research Review, 48:100714, 2025.https://doi.org/10.1016/j.edurev.2025.100714

work page doi:10.1016/j.edurev.2025.100714 2025

[9] [9]

Prather et al

J. Prather et al. The widening gap: The benefits and harms of generative AI for novice programmers. InProceedings of the 2024 ACM Conference on International Computing Education Research, pages 469–486, 2024

work page 2024

[10] [10]

J. Zhi, H. Kumar, and M. Lee. Investigating the effects of LLM use on critical thinking under time constraints: Access timing and time availability. arXiv:2603.08849, 2026

work page arXiv 2026

[11] [11]

Dawson, M

P. Dawson, M. Bearman, M. Dollinger, and D. Boud. Validity matters more than cheating. Assessment & Evaluation in Higher Education, 49(7):1005–1016, 2024.https://doi.org/ 10.1080/02602938.2024.2386662

work page doi:10.1080/02602938.2024.2386662 2024

[12] [12]

Perkins, L

M. Perkins, L. Furze, J. Roe, and J. MacVaugh. The Artificial Intelligence Assessment Scale (AIAS): A framework for ethical integration of generative AI in educational assessment. Journal of University Teaching & Learning Practice, 21(6), 2024.https://doi.org/10. 53761/q3azde36

work page 2024

[13] [13]

Perkins, J

M. Perkins, J. Roe, and L. Furze. Reimagining the Artificial Intelligence Assessment Scale: A refined framework for educational assessment.Journal of University Teaching & Learning Practice, 22(7), 2025.https://doi.org/10.53761/rrm4y757

work page doi:10.53761/rrm4y757 2025

[14] [14]

Tregloan and H

K. Tregloan and H. S. Song. From How Much to Whodunnit: A framework for authorising and evaluating student AI use. In T. Cochrane et al. (eds.),Proceedings ASCILITE 2024, pages 255–265, 2024.https://doi.org/10.14742/apubs.2024.1441

work page doi:10.14742/apubs.2024.1441 2024

[15] [15]

J. M. Lodge, S. Howard, M. Bearman, P. Dawson, and Associates.Assess- ment reform for the age of artificial intelligence. Tertiary Education Quality and Standards Agency, Australian Government, 2023. Landing page: https: //www.teqsa.gov.au/guides-resources/resources/corporate-publications/ assessment-reform-age-artificial-intelligence

work page 2023

[16] [16]

Tertiary Education Quality and Standards Agency.Gen AI Strategies for Australian Higher Education: Emerging Practice. 2024. https:// www.teqsa.gov.au/guides-resources/resources/corporate-publications/ gen-ai-strategies-australian-higher-education-emerging-practice

work page 2024

[17] [17]

J. M. Lodge, M. Bearman, P. Dawson, H. Gniel, R. Harper, D. Liu, J. McLean, L. Ucnik, and Associates.Enacting assessment reform in a time of artificial intelligence. Tertiary Education Quality and Standards Agency, Aus- tralian Government, 2025.https://www.teqsa.gov.au/sites/default/files/2025-09/ enacting-assessment-reform-in-a-time-of-artificial-intelli...

work page 2025

[18] [18]

A.-M. Chase. Assessment by design: A classification framework for learning assurance in the age of GenAI. InProceedings ASCILITE 2025, pages 615–624, 2025.https://doi.org/ 10.65106/apubs.2025.2785

work page doi:10.65106/apubs.2025.2785 2025

[19] [19]

Henderson, M

T. Corbin, P. Dawson, and D. Liu. Talk is cheap: Why structural assessment changes are needed for a time of GenAI.Assessment & Evaluation in Higher Education, 50(7):1087–1097, 2025.https://doi.org/10.1080/02602938.2025.2503964

work page doi:10.1080/02602938.2025.2503964 2025

[20] [20]

Why Should I Trust You?

M. T. Ribeiro, S. Singh, and C. Guestrin. “Why should I trust you?”: Explaining the predictions of any classifier. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1135–1144, 2016.https: //doi.org/10.1145/2939672.2939778. 23

work page doi:10.1145/2939672.2939778 2016

[21] [21]

S. M. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30, pages 4765–4774, 2017.https: //dl.acm.org/doi/10.5555/3295222.3295230

work page doi:10.5555/3295222.3295230 2017

[22] [22]

C. Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.Nature Machine Intelligence, 1(5):206–215, 2019. https://doi.org/10.1038/s42256-019-0048-x

work page doi:10.1038/s42256-019-0048-x 2019

[23] [23]

Clear Sanctions, Vague Rewards: How China’s Social Credit System Currently Defines

M. Mitchell et al. Model cards for model reporting. InProceedings of the 2019 Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019.https://doi.org/10. 1145/3287560.3287596

work page arXiv 2019

[24] [24]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. https://doi.org/10.1145/3458723

work page doi:10.1145/3458723 2021

[25] [25]

Human-in-the-loop machine learning: a state of the art,

E. Mosqueira-Rey, E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-Bascarán, and Á. Fernández-Leal. Human-in-the-loop machine learning: a state of the art.Artificial Intelli- gence Review, 56:3005–3054, 2023.https://doi.org/10.1007/s10462-022-10246-w

work page doi:10.1007/s10462-022-10246-w 2023

[26] [26]

Smith, and Oren Etzioni

R. Schwartz, J. Dodge, N. A. Smith, and O. Etzioni. Green AI.Communications of the ACM, 63(12):54–63, 2020.https://doi.org/10.1145/3381831

work page doi:10.1145/3381831 2020

[27] [27]

Natarajan, S

S. Natarajan, S. Mathur, S. Sidheekh, W. Stammer, and K. Kersting. Human-in-the-loop or AI-in-the-loop? Automate or collaborate?Proceedings of the AAAI Conference on Artificial Intelligence, 39(27):29473–29480, 2025.https://doi.org/10.1609/aaai.v39i27.35083

work page doi:10.1609/aaai.v39i27.35083 2025

[28] [28]

Vinci, L

V. Vinci, L. S. Agrati, P. Berardi, and A. Beri. Artificial intelligence as a catalyst for transformative assessment: Designing teacher literacy at the crossroads of ethics, pedagogy, and human relationships.Frontiers in Education, 11:1760626, 2026.https://doi.org/10. 3389/feduc.2026.1760626

work page arXiv 2026

[29] [29]

Shintani.AI to Learn (AI2L): Human-Centered Guidelines for Black-Box-Free AI and Empirical Law Discovery via Symbolic Regression

S. Shintani.AI to Learn (AI2L): Human-Centered Guidelines for Black-Box-Free AI and Empirical Law Discovery via Symbolic Regression. Jxiv preprint, 2025.https://doi.org/ 10.51094/jxiv.1435

work page doi:10.51094/jxiv.1435 2025

[30] [30]

S. A. Shintani. Self-hosted Lecture-to-Quiz: Local LLM MCQ generation with deterministic quality control. arXiv:2603.08729, 2026.https://arxiv.org/abs/2603.08729

work page arXiv 2026

[31] [31]

T. K. Koo and M. Y. Li. A guideline of selecting and reporting intraclass correlation coefficients for reliability research.Journal of Chiropractic Medicine, 15(2):155–163, 2016. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016

[32] [32]

Vanbelle et al

S. Vanbelle et al. A comprehensive guide to study the agreement and reliability of ordinal scores in the presence of multiple observers.BMC Medical Research Methodology, 24, 2024. https://doi.org/10.1186/s12874-024-02431-y

work page doi:10.1186/s12874-024-02431-y 2024

[33] [33]

Kangwa, M

D. Kangwa, M. M. Msafiri, and A. Fute. Balancing innovation and ethics: Promote academic integrity through support and effective use of GenAI tools in higher education.AI and Ethics, 5(4):3497–3530, 2025.https://doi.org/10.1007/s43681-025-00689-6

work page doi:10.1007/s43681-025-00689-6 2025

[34] [34]

P. D. N. Ncube, G. P. Dzvapatsva, C. Matobobo, and M. M. Ranga. Redefining student assessment in AI-infused learning environments: A systematic review of challenges and strategies for academic integrity.AI and Ethics, 6:68, 2026.https://doi.org/10.1007/ s43681-025-00871-w. 24

work page 2026

[35] [35]

Coates, G

H. Coates, G. Croucher, and A. Calderon. Governing academic integrity: Ensuring the authenticity of higher thinking in the era of generative artificial intelligence.Journal of Academic Ethics, 23:2015–2028, 2025.https://doi.org/10.1007/s10805-025-09639-7. 25

work page doi:10.1007/s10805-025-09639-7 2015