The paper recommends shifting CS education from implementation skills to AI-native competencies, fundamental concepts, and critical evaluation based on workshop consensus.
Scalable and Personalized Oral Assessments Using Voice AI
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Students in our AI/ML course submitted polished, well-argued project analyses. Then, in class discussion, we asked them to walk through a single choice from their own work. Many could not. The writing looked great. The understanding often wasn't. Oral examinations retain an evidentiary link where written work no longer does: a student who can reason aloud, defend a decision under follow-up, and adapt when pushed demonstrates something no submitted document can certify. The obstacle has always been cost. A 25-minute oral reviewed by two graders takes roughly 30 combined instructor and TA hours for 36 students; at 100 the format is untenable. Voice AI and automated grading change the arithmetic. We built Viva, a system that conducts a personalized oral exam, then grades the transcript with a panel of three LLMs that score independently, read each other's assessments, and revise. Across two undergraduate cohorts at NYU Stern (36 students in Fall 2025, 37 in Spring 2026), grading-LLM cost stayed under one dollar per exam within the ElevenLabs subscription covering our voice minutes; for deployments exceeding an equivalent credit pool, budget about a dollar per ten minutes of graded exam time, practical for weekly assignments, not just finals. The system also broke instructively: the agent asked several questions at once, failed to randomize topics across the cohort, and a voice cloned from the professor's came across as harsh, replaced in Spring 2026 with a calm preset. These failures, with an earlier finding that a monolithic agent handling both examination and grading proved unreliable, point to five candidate transferable patterns: decompose into single-purpose modules, constrain behavior with code rather than prompts, keep randomization out of the LLM, grade with a multi-model panel whose members disagree, and choose voice characteristics with the same care as question design.
fields
cs.CY 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Reshaping Undergraduate Computer Science Education in the Generative AI Era
The paper recommends shifting CS education from implementation skills to AI-native competencies, fundamental concepts, and critical evaluation based on workshop consensus.