Scalable and Personalized Oral Assessments Using Voice AI
Pith reviewed 2026-05-21 10:00 UTC · model grok-4.3
The pith
Voice AI plus a multi-LLM panel can run personalized oral exams at under one dollar each while revealing reusable design patterns.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Oral examinations retain an evidentiary link where written work no longer does, yet a 25-minute oral reviewed by two graders takes roughly 30 combined instructor and TA hours for 36 students. Viva separates the examination into a voice-AI module and the grading into a multi-model panel whose members score independently, read each other's assessments, and revise. Across 73 students in two semesters the grading-LLM cost remained under one dollar per exam within the existing ElevenLabs subscription; the system also exposed failures in multi-question phrasing, lack of randomization, and voice tone that led to five transferable patterns for building similar tools.
What carries the argument
Viva, a decomposed system that uses a voice AI agent to conduct personalized oral exams and a panel of three LLMs to grade the resulting transcripts independently before mutual revision.
If this is right
- Oral exams can shift from end-of-term events to weekly or bi-weekly checks without linear growth in instructor time.
- Instructors can allocate effort to question design and high-level review rather than conducting every session.
- Constraining AI behavior through code instead of prompts improves consistency across students.
- Keeping randomization outside the LLM prevents the agent from favoring easier or harder topics for some students.
- Voice characteristics must be chosen and tested with the same care given to question content.
Where Pith is reading between the lines
- The modular decomposition pattern could transfer to other voice-based tutoring or interview tools outside formal courses.
- Adding outcome measures such as follow-up exam performance would test whether the oral scores predict later mastery.
- Deployments beyond 100 students would reveal whether the current per-exam cost scaling continues linearly.
- The same multi-panel revision approach might reduce bias in automated feedback for written assignments.
Load-bearing premise
The multi-LLM grading panel produces scores that validly reflect student understanding.
What would settle it
A side-by-side comparison of the multi-LLM panel scores against independent human grader scores on the same set of oral-exam transcripts.
Figures
read the original abstract
Students in our AI/ML course submitted polished, well-argued project analyses. Then, in class discussion, we asked them to walk through a single choice from their own work. Many could not. The writing looked great. The understanding often wasn't. Oral examinations retain an evidentiary link where written work no longer does: a student who can reason aloud, defend a decision under follow-up, and adapt when pushed demonstrates something no submitted document can certify. The obstacle has always been cost. A 25-minute oral reviewed by two graders takes roughly 30 combined instructor and TA hours for 36 students; at 100 the format is untenable. Voice AI and automated grading change the arithmetic. We built Viva, a system that conducts a personalized oral exam, then grades the transcript with a panel of three LLMs that score independently, read each other's assessments, and revise. Across two undergraduate cohorts at NYU Stern (36 students in Fall 2025, 37 in Spring 2026), grading-LLM cost stayed under one dollar per exam within the ElevenLabs subscription covering our voice minutes; for deployments exceeding an equivalent credit pool, budget about a dollar per ten minutes of graded exam time, practical for weekly assignments, not just finals. The system also broke instructively: the agent asked several questions at once, failed to randomize topics across the cohort, and a voice cloned from the professor's came across as harsh, replaced in Spring 2026 with a calm preset. These failures, with an earlier finding that a monolithic agent handling both examination and grading proved unreliable, point to five candidate transferable patterns: decompose into single-purpose modules, constrain behavior with code rather than prompts, keep randomization out of the LLM, grade with a multi-model panel whose members disagree, and choose voice characteristics with the same care as question design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Viva, a Voice AI system for conducting personalized oral exams in an AI/ML course followed by automated grading of transcripts via a panel of three LLMs that score independently, review each other, and revise. Deployed with two undergraduate cohorts at NYU Stern (36 students Fall 2025, 37 Spring 2026), the system kept LLM grading costs under one dollar per exam. From observed failures (multi-question prompts, non-randomized topics, harsh cloned voice), the authors derive five transferable patterns: decompose into single-purpose modules, constrain behavior with code, keep randomization out of the LLM, grade with a multi-model panel whose members disagree, and choose voice characteristics carefully.
Significance. If the multi-LLM panel produces valid scores, the work offers a concrete, low-cost path to scalable oral assessment that could restore evidentiary value lost in written submissions. The explicit cost figures and systematic extraction of design patterns from real deployments provide immediately usable guidance for educators. The contribution is practical and timely for AI-assisted education but remains conditional on unshown validation of grading quality.
major comments (2)
- [Abstract] Abstract: the claim that 'grade with a multi-model panel whose members disagree' constitutes a transferable pattern is load-bearing for the central contribution, yet the manuscript supplies no quantitative support. It describes only that the LLMs 'score independently, read each other's assessments, and revise' without reporting inter-rater reliability (Cohen's kappa), Pearson/Spearman correlation with human graders on the same transcripts, or alignment with learning-outcome measures. This leaves the pattern resting on the untested assumption that multi-model disagreement plus revision improves validity over single-model or human grading.
- [Deployment and Results] Deployment description: while cost data are concrete (under $1 per exam within the ElevenLabs subscription), no table or section presents grading accuracy, inter-LLM agreement statistics, student performance deltas, or comparison against traditional oral exams. Without such evidence the assertion that the system 'retains an evidentiary link' to understanding cannot be evaluated and the generalizability of the five patterns is undercut.
minor comments (2)
- [Abstract] Abstract: the clause 'at 100 the format is untenable' is unclear; specify whether this refers to 100 students, 100 exams, or another quantity.
- The five patterns are listed in the abstract but would benefit from a short table in the main text that maps each pattern to the specific failure observed and the code-level or configuration change made.
Simulated Author's Rebuttal
We thank the referee for these focused comments on validation and evidence. Our manuscript reports a practical deployment of Viva and extracts design patterns from observed failures rather than presenting a controlled validation study of grading accuracy. We respond to each point below and have revised the manuscript to clarify the observational basis of our claims, add a limitations section, and moderate language on generalizability.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'grade with a multi-model panel whose members disagree' constitutes a transferable pattern is load-bearing for the central contribution, yet the manuscript supplies no quantitative support. It describes only that the LLMs 'score independently, read each other's assessments, and revise' without reporting inter-rater reliability (Cohen's kappa), Pearson/Spearman correlation with human graders on the same transcripts, or alignment with learning-outcome measures. This leaves the pattern resting on the untested assumption that multi-model disagreement plus revision improves validity over single-model or human grading.
Authors: We agree that no quantitative metrics such as Cohen's kappa, correlations with human graders, or alignment with learning outcomes are reported. The multi-model panel was introduced after a single-LLM grader produced inconsistent results in early internal tests; the pattern is offered as a candidate insight drawn from that deployment experience, not as a statistically validated improvement. We have revised the abstract to describe the patterns as 'observed' rather than asserted as transferable best practices and have added explicit language in the discussion calling for future comparative validation studies. revision: partial
-
Referee: [Deployment and Results] Deployment description: while cost data are concrete (under $1 per exam within the ElevenLabs subscription), no table or section presents grading accuracy, inter-LLM agreement statistics, student performance deltas, or comparison against traditional oral exams. Without such evidence the assertion that the system 'retains an evidentiary link' to understanding cannot be evaluated and the generalizability of the five patterns is undercut.
Authors: The paper centers on system design, per-exam costs, and lessons extracted from concrete failures across two small cohorts. We did not collect parallel human grades or run a controlled comparison against traditional oral exams, so no accuracy tables or performance deltas appear. The phrase 'retains an evidentiary link' refers to the inherent properties of oral assessment rather than new empirical results from Viva. We have inserted a limitations paragraph that directly acknowledges the absence of these statistics and have softened claims about the generalizability of the five patterns. revision: yes
- We do not possess human-graded transcripts or inter-rater reliability data from the described deployments, as the study was not designed to include such comparisons.
Circularity Check
No circularity; observational patterns from deployed system are self-contained
full rationale
The paper describes the Viva system implementation, reports per-exam costs under one dollar across two cohorts, catalogs specific failure modes (multi-question prompts, non-randomized topics, harsh voice), and lists five candidate patterns drawn directly from those observations plus one prior internal trial. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. The central claims rest on empirical deployment logs rather than any reduction to inputs by construction or self-citation chains. This is a standard case-study report whose validity can be assessed against external replication or human-grader agreement data, none of which is required for the circularity check.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM panels can produce valid oral-exam grades from transcripts without additional human calibration data
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We implemented a “council of LLMs” (three models from different families that independently score the same transcript, then revise after seeing each other’s reasoning) using Claude, Gemini, and GPT-5... After deliberation, α=0.86 (dimension-level)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The recurring lesson is that behavioral constraints on LLMs must be enforced through architecture, not prompting alone... five candidate transferable patterns: decompose into single-purpose modules, constrain behavior with code rather than prompts...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Tiffany Bayley, Kyle D. S. Maclean, and Tessa Weidner. 2024. Back to the Future: Implementing Large-Scale Oral Exams.Management Teaching Review11, 1 (2024), 159–170. doi:10.1177/23792981241267744
-
[2]
Computing education in the era of generative AI,
Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing Education in the Era of Generative AI. Commun. ACM67, 2 (Jan. 2024), 56–67. doi:10.1145/3624720
-
[3]
ElevenLabs. 2024. Conversational AI Platform. https://elevenlabs.io/docs/conversational-ai/overview. Accessed January 2025
work page 2024
-
[4]
Andrea Fenton. 2025. Reconsidering the Use of Oral Exams and Assessments: An Old Way to Move Into a New Future. Educational Researcher54, 7 (2025), 430–436. doi:10.3102/0013189X251333638
-
[5]
Catherine Hartmann. 2025. Oral Exams for a Generative AI World: Managing Concerns and Logistics for Undergraduate Humanities Instruction.College Teaching(2025). doi:10.1080/87567555.2025.2558563
-
[6]
Brian Jabarian and Léa Henkel. 2025. Voice AI in Firms: A Natural Field Experiment on Automated Job Interviews. doi:10.2139/ssrn.5395709 SSRN 5395709
-
[7]
Yang Jiang, Jiangang Hao, Michael Fauss, and Chen Li. 2024. Detecting ChatGPT-generated essays in a large-scale writing assessment: Is there a bias against non-native English speakers?Computers and Education217 (2024), 105070. doi:10.1016/j.compedu.2024.105070
-
[8]
Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia , Vol. 1, No. 1, Article . Publication date: March 2026. Scalable and Personalized Oral Assessments Using Voice AI 11 Nerdel, Jürgen Pf...
-
[9]
Klaus Krippendorff. 2011. Computing Krippendorff’s Alpha-Reliability.Departmental Papers (ASC)(2011). University of Pennsylvania
work page 2011
-
[10]
Muhammed Ashraf Memon, Gordon Rowland Joughin, and Breda Memon. 2010. Oral assessment and postgraduate medical examinations: establishing conditions for validity, reliability and fairness.Advances in Health Sciences Education 15, 2 (2010), 277–289. doi:10.1007/s10459-008-9111-9
- [11]
-
[12]
Shashi Nallaya, Sheridan Gentili, Scott Weeks, and Katherine Baldock. 2024. The validity, reliability, academic integrity and integration of oral assessments in higher education: A systematic review.Issues in Educational Research34, 2 (2024), 629–646. http://www.iier.org.au/iier34/nallaya.pdf
work page 2024
- [13]
-
[14]
Shermis and Jill Burstein (Eds.)
Mark D. Shermis and Jill Burstein (Eds.). 2013.Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge, New York
work page 2013
-
[15]
Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796 [cs.CL] https://arxiv.org/abs/2404.18796
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685 , Vol. 1, No. 1, Article . Publication date: March 2026
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.