Toward AI-Resilient Assessment in Computer Science Courses in an AI-Native World

Anshumali Shrivastava

arxiv: 2606.30655 · v1 · pith:PVPDZIKWnew · submitted 2026-06-16 · 💻 cs.CY · cs.AI

Toward AI-Resilient Assessment in Computer Science Courses in an AI-Native World

Anshumali Shrivastava This is my paper

Pith reviewed 2026-07-01 07:26 UTC · model grok-4.3

classification 💻 cs.CY cs.AI

keywords AI-resilient assessmentPareto surpluscomputer science educationAI-native Pareto frontierexecutable evaluatorBloom filters

0 comments

The pith

Grading by Pareto surplus over a declared AI baseline certifies submissions that exceed what strong AI systems achieve on the same task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes grading senior computer science work by how far a submission exceeds an AI-native Pareto frontier of achievable tradeoffs. An executable evaluator defines the frontier from strong AI baselines, and students receive credit only for surplus beyond that frontier even if they use AI tools freely during their work. This produces a behavioral certificate of performance not already supplied by the baseline, which the surrounding protocol can then interpret as student skill. A sympathetic reader would care because the method keeps assessment meaningful when AI capabilities are strong and improving. The framework is shown in a Bloom filter assignment that asks whether students can improve on AI-generated implementations.

Core claim

The central claim is that Pareto surplus provides a measurable, protocol-relative certificate that a submitted artifact achieves a tradeoff not already supplied by the declared AI baseline, and grading by this surplus is AI-resilient with respect to that baseline. Interpreting surplus as evidence of student skill requires the surrounding assessment protocol such as design reports or oral checks, but the grading certificate itself is behavioral and executable. The framework extends to complications including self-improving AI loops, budget neutrality, and prompt-based red teaming.

What carries the argument

Pareto surplus, the measurable improvement in tradeoff achieved by the student's artifact relative to the declared AI-native Pareto frontier.

If this is right

Students may use AI freely during assignment work without the grade depending on private AI budget or intensity of use.
The same surplus rule applies across different tasks once an evaluator and frontier are declared for each task.
Additional protocol elements such as ablations or reproducibility explanations are needed to convert the behavioral certificate into a claim of student skill.
The method can be instantiated for concrete tasks such as approximate membership queries with Bloom filters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested by running the same assignment with and without the surplus rule and comparing grade distributions to traditional methods.
Periodic re-computation of the AI frontier would be required as base AI models improve, to keep the surplus measure stable over time.
The framework might apply outside computer science to any domain where an executable evaluator of multi-objective performance can be written.

Load-bearing premise

An executable evaluator can be defined such that the declared AI-native Pareto frontier accurately and fairly represents the performance achievable by strong AI systems without the student's own AI usage biasing the frontier construction or evaluation.

What would settle it

An observation that students routinely achieve high surplus by exploiting gaps in the evaluator definition or by biasing the frontier construction rather than by producing genuinely superior artifacts.

read the original abstract

AI-native course assessments in senior computer science courses and related fields should grade students by \emph{AI-resilient skill}: the ability to achieve outcomes beyond a strong AI baseline. Such assessments should allow students to use AI freely, while reducing the extent to which greater private AI budget or more intensive AI use, by itself, becomes a grading advantage. This paper proposes a minimal formal framework for this goal. The framework specifies a real task, an executable evaluator, a declared AI-native Pareto frontier, and a grading rule based on Pareto surplus. The central claim is simple: Pareto surplus provides a measurable, protocol-relative certificate that a submitted artifact achieves a tradeoff not already supplied by the declared AI baseline, and grading by this surplus is AI-resilient with respect to that baseline. Interpreting surplus as evidence of student skill requires the surrounding assessment protocol--for example, design reports, ablations, prompt traces, oral checks, or reproducibility explanations--but the grading certificate itself is behavioral and executable. The framework is then extended to practical complications, including self-improving AI loops, budget neutrality, server-mediated feedback, and prompt-based red teaming. As a concrete instantiation, we describe an AI-resilient approximate-membership assignment centered on Bloom filters for COMP 480/580 at Rice University, designed to test whether students can improve beyond AI-generated implementations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean formal proposal for grading by Pareto surplus over a declared AI baseline, but the executability of that baseline is asserted rather than shown.

read the letter

The core idea here is straightforward: define a task and an executable evaluator, declare an AI-native Pareto frontier, then grade submissions by how much they exceed that frontier. This lets students use AI freely while trying to isolate skill that isn't just more compute or better prompting. The separation between the behavioral certificate and the surrounding protocol (reports, traces, orals) is a useful distinction.

What stands out is the minimal framing. It avoids overclaiming and focuses on making the surplus measurable relative to whatever frontier the instructor declares. The Bloom-filter assignment sketch for the Rice course is a reasonable concrete hook.

The main limitation is that the framework's value hinges on the declared frontier being fixed, complete, and unbiased by the same AI capabilities students can access. The abstract mentions red-teaming and self-improving loops as extensions but does not supply a method to ensure the frontier is closed under student-accessible search. Without that, surplus can be produced by better enumeration rather than student insight. No validation data or worked example of frontier construction appears.

This is aimed at CS instructors who need to redesign assessments rather than ban AI. It is a conceptual piece that deserves referee time because the problem is real and the formalization is new enough to warrant discussion, even if the practical gaps need filling.

Referee Report

2 major / 2 minor

Summary. The paper proposes a minimal formal framework for AI-resilient assessment in senior CS courses. Students may use AI freely, but grading is based on Pareto surplus relative to a declared AI-native Pareto frontier for a given task and executable evaluator. The central claim is that this surplus supplies a protocol-relative, behavioral certificate that an artifact exceeds the declared baseline, rendering the grade AI-resilient with respect to that baseline. The framework is extended to self-improving loops, budget neutrality, and red-teaming; a concrete Bloom-filter approximate-membership assignment for COMP 480/580 is described as an instantiation.

Significance. If an executable, unbiased frontier can be maintained, the approach would let instructors permit unrestricted AI use while still distinguishing student skill from private AI budget or intensity. The proposal is timely for AI-native education and supplies a clean, protocol-relative metric rather than ad-hoc detection rules. Credit is due for the explicit separation of the behavioral certificate from surrounding protocol elements (reports, orals) and for naming the free parameter (the declared frontier) up front.

major comments (2)

[Abstract / central claim] Abstract and central claim paragraph: the assertion that 'grading by this surplus is AI-resilient with respect to that baseline' is load-bearing yet rests on the untested assumption that an executable evaluator exists such that the declared frontier is fixed, complete, and not biased by student-accessible AI configurations. The Bloom-filter instantiation constructs the frontier from the instructor's own AI runs; any student prompt/model pair that discovers an unenumerated point on the true capability surface would register as surplus indistinguishable from genuine skill.
[Extensions to practical complications] Section on practical complications (self-improving AI loops and prompt-based red teaming): the paper notes these extensions but supplies no concrete mechanism ensuring the declared set remains closed under student-accessible search. Without such a mechanism the surplus metric risks becoming tautological with the chosen baseline, exactly as flagged by the circularity concern.

minor comments (2)

[Framework definition] The abstract refers to 'an executable evaluator' without specifying its interface or termination guarantees; a short pseudocode or formal signature in the framework section would clarify executability.
[Concrete instantiation] The Bloom-filter instantiation is described at a high level; adding the precise performance dimensions (false-positive rate vs. memory vs. query time) used for the Pareto surface would make the example reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below, clarifying the scope of our minimal framework while agreeing where revisions can strengthen the presentation.

read point-by-point responses

Referee: [Abstract / central claim] Abstract and central claim paragraph: the assertion that 'grading by this surplus is AI-resilient with respect to that baseline' is load-bearing yet rests on the untested assumption that an executable evaluator exists such that the declared frontier is fixed, complete, and not biased by student-accessible AI configurations. The Bloom-filter instantiation constructs the frontier from the instructor's own AI runs; any student prompt/model pair that discovers an unenumerated point on the true capability surface would register as surplus indistinguishable from genuine skill.

Authors: The central claim is that Pareto surplus supplies a protocol-relative behavioral certificate of exceeding the declared baseline, rendering the grade AI-resilient specifically with respect to that baseline by construction. The manuscript already states that interpreting the surplus as evidence of student skill requires the surrounding protocol (reports, orals, etc.). The Bloom-filter example is presented as one concrete instantiation in which the instructor declares the frontier from their own runs; any superior point found by a student is intended to count as surplus. We do not claim the declared frontier equals the complete true capability surface. We will revise the abstract and central claim paragraph to emphasize more explicitly that resilience holds relative to the declared baseline and that frontier quality remains an instructor responsibility. revision: partial
Referee: [Extensions to practical complications] Section on practical complications (self-improving AI loops and prompt-based red teaming): the paper notes these extensions but supplies no concrete mechanism ensuring the declared set remains closed under student-accessible search. Without such a mechanism the surplus metric risks becoming tautological with the chosen baseline, exactly as flagged by the circularity concern.

Authors: The manuscript introduces a minimal formal framework and discusses self-improving loops and red-teaming only at a conceptual level to indicate possible extensions. No concrete closure mechanism is supplied because developing and validating such a mechanism lies outside the paper's scope. The framework treats frontier maintenance as part of the instructor's broader assessment protocol; any circularity risk is therefore managed by that protocol rather than by the surplus metric alone. revision: no

Circularity Check

0 steps flagged

No circularity: framework defines metric relative to declared baseline without reducing predictions to inputs

full rationale

The paper proposes a definitional framework (real task + executable evaluator + declared Pareto frontier + surplus grading rule) whose central claim is that surplus certifies a tradeoff beyond the declared baseline and is therefore AI-resilient w.r.t. that baseline. This relation holds by the explicit definition of surplus rather than by deriving an independent prediction from data or prior results. No equations, fitted parameters, or self-citations are shown that would make a claimed result equivalent to its inputs by construction. The text explicitly notes that interpreting surplus as skill evidence requires additional protocol elements and extensions for self-improving loops, indicating the framework is offered as a protocol-relative certificate rather than a closed, self-verifying derivation. The provided abstract and context contain no load-bearing self-citation chains, ansatz smuggling, or renaming of known results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the existence of an executable evaluator and the ability to declare an AI baseline frontier without circular dependence on student submissions.

free parameters (1)

Declared AI-native Pareto frontier
The frontier is declared by the instructor; its construction method and update frequency are not specified in the abstract.

axioms (1)

domain assumption An executable evaluator exists that can measure multi-objective tradeoffs between AI-generated and student-submitted artifacts.
This is required for the surplus certificate to be behavioral and executable.

pith-pipeline@v0.9.1-grok · 5767 in / 1161 out tokens · 27762 ms · 2026-07-01T07:26:15.411023+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 7 canonical work pages

[1]

Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. doi:10.1145/362686.362692

work page doi:10.1145/362686.362692 1970
[2]

Network applications of Bloom filters: A survey

Andrei Broder and Michael Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004. doi:10.1080/15427951.2004.10129096

work page doi:10.1080/15427951.2004.10129096 2004
[3]

Kizilcec

Igor Chirikov, Ivan Smirnov, and Ren \'e F. Kizilcec. Generative AI use and misuse call for assessment reform in higher education. Science, 392(6800):818--820, 2026. doi:10.1126/science.aec5115

work page doi:10.1126/science.aec5115 2026
[4]

Multi-Objective Optimization Using Evolutionary Algorithms

Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons, 2001

2001
[5]

Student Generative Artificial Intelligence Survey 2026

Rose Stephenson and Charlotte Armstrong. Student Generative Artificial Intelligence Survey 2026. Higher Education Policy Institute Report 199, 2026. https://www.hepi.ac.uk/reports/student-generative-ai-survey-2026/

2026
[6]

Jason M. Lodge. The Evolving Risk to Academic Integrity Posed by Generative Artificial Intelligence: Options for Immediate Action. Tertiary Education Quality and Standards Agency, 2024. https://www.teqsa.gov.au/sites/default/files/2024-08/evolving-risk-to-academic-integrity-posed-by-generative-artificial-intelligence.pdf

2024
[7]

Lodge, Suijing Yang, Leon Furze, and Phillip Dawson

Jason M. Lodge, Suijing Yang, Leon Furze, and Phillip Dawson. It's not like a calculator, so what is the relationship between learners and generative artificial intelligence? Learning: Research and Practice, 9(2):117--124, 2023. doi:10.1080/23735082.2023.2261106

work page doi:10.1080/23735082.2023.2261106 2023
[8]

Introduction to the Theory of Computation

Michael Sipser. Introduction to the Theory of Computation. Cengage Learning, 3rd edition, 2012

2012
[9]

Guidance for Generative AI in Education and Research

Fengchun Miao and Wayne Holmes. Guidance for Generative AI in Education and Research. UNESCO, 2023. https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research

2023
[10]

Nonlinear Multiobjective Optimization

Kaisa Miettinen. Nonlinear Multiobjective Optimization. Kluwer Academic Publishers, 1999. doi:10.1007/978-1-4615-5563-6

work page doi:10.1007/978-1-4615-5563-6 1999
[11]

Princeton faculty mandate proctoring for in-person exams

Devon Williams. Princeton faculty mandate proctoring for in-person exams. The Daily Princetonian, May 2026. https://www.dailyprincetonian.com/article/2026/05/princeton-news-adpol-proctoring-in-person-examinations-passed-faculty-133-years-precedent

2026
[12]

Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach

Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach. IEEE Transactions on Evolutionary Computation, 3(4):257--271, 1999. doi:10.1109/4235.797969

work page doi:10.1109/4235.797969 1999
[13]

Fonseca, and Viviane Grunert da Fonseca

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Viviane Grunert da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117--132, 2003. doi:10.1109/TEVC.2003.810758

work page doi:10.1109/tevc.2003.810758 2003

[1] [1]

Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. doi:10.1145/362686.362692

work page doi:10.1145/362686.362692 1970

[2] [2]

Network applications of Bloom filters: A survey

Andrei Broder and Michael Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2004. doi:10.1080/15427951.2004.10129096

work page doi:10.1080/15427951.2004.10129096 2004

[3] [3]

Kizilcec

Igor Chirikov, Ivan Smirnov, and Ren \'e F. Kizilcec. Generative AI use and misuse call for assessment reform in higher education. Science, 392(6800):818--820, 2026. doi:10.1126/science.aec5115

work page doi:10.1126/science.aec5115 2026

[4] [4]

Multi-Objective Optimization Using Evolutionary Algorithms

Kalyanmoy Deb. Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons, 2001

2001

[5] [5]

Student Generative Artificial Intelligence Survey 2026

Rose Stephenson and Charlotte Armstrong. Student Generative Artificial Intelligence Survey 2026. Higher Education Policy Institute Report 199, 2026. https://www.hepi.ac.uk/reports/student-generative-ai-survey-2026/

2026

[6] [6]

Jason M. Lodge. The Evolving Risk to Academic Integrity Posed by Generative Artificial Intelligence: Options for Immediate Action. Tertiary Education Quality and Standards Agency, 2024. https://www.teqsa.gov.au/sites/default/files/2024-08/evolving-risk-to-academic-integrity-posed-by-generative-artificial-intelligence.pdf

2024

[7] [7]

Lodge, Suijing Yang, Leon Furze, and Phillip Dawson

Jason M. Lodge, Suijing Yang, Leon Furze, and Phillip Dawson. It's not like a calculator, so what is the relationship between learners and generative artificial intelligence? Learning: Research and Practice, 9(2):117--124, 2023. doi:10.1080/23735082.2023.2261106

work page doi:10.1080/23735082.2023.2261106 2023

[8] [8]

Introduction to the Theory of Computation

Michael Sipser. Introduction to the Theory of Computation. Cengage Learning, 3rd edition, 2012

2012

[9] [9]

Guidance for Generative AI in Education and Research

Fengchun Miao and Wayne Holmes. Guidance for Generative AI in Education and Research. UNESCO, 2023. https://www.unesco.org/en/articles/guidance-generative-ai-education-and-research

2023

[10] [10]

Nonlinear Multiobjective Optimization

Kaisa Miettinen. Nonlinear Multiobjective Optimization. Kluwer Academic Publishers, 1999. doi:10.1007/978-1-4615-5563-6

work page doi:10.1007/978-1-4615-5563-6 1999

[11] [11]

Princeton faculty mandate proctoring for in-person exams

Devon Williams. Princeton faculty mandate proctoring for in-person exams. The Daily Princetonian, May 2026. https://www.dailyprincetonian.com/article/2026/05/princeton-news-adpol-proctoring-in-person-examinations-passed-faculty-133-years-precedent

2026

[12] [12]

Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach

Eckart Zitzler and Lothar Thiele. Multiobjective evolutionary algorithms: A comparative case study and the strength Pareto approach. IEEE Transactions on Evolutionary Computation, 3(4):257--271, 1999. doi:10.1109/4235.797969

work page doi:10.1109/4235.797969 1999

[13] [13]

Fonseca, and Viviane Grunert da Fonseca

Eckart Zitzler, Lothar Thiele, Marco Laumanns, Carlos M. Fonseca, and Viviane Grunert da Fonseca. Performance assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on Evolutionary Computation, 7(2):117--132, 2003. doi:10.1109/TEVC.2003.810758

work page doi:10.1109/tevc.2003.810758 2003