pith. sign in

arxiv: 2603.18221 · v2 · pith:3BIFNLBZnew · submitted 2026-03-18 · 💻 cs.CY

Scalable and Personalized Oral Assessments Using Voice AI

Pith reviewed 2026-05-21 10:00 UTC · model grok-4.3

classification 💻 cs.CY
keywords oral examsvoice AILLM gradingeducational technologyscalable assessmentAI agents in educationpersonalized exams
0
0 comments X

The pith

Voice AI plus a multi-LLM panel can run personalized oral exams at under one dollar each while revealing reusable design patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that oral exams, which preserve a direct link between student reasoning and demonstrated understanding, can be made practical for classes of dozens or more by automating both the live questioning and the subsequent grading. Written work often looks polished while hiding shallow comprehension; live follow-up questions expose this gap but have historically required too many instructor hours to scale. Viva conducts the oral session through a voice agent and grades the transcript with three independent LLMs that score, read one another's assessments, and revise. In two NYU Stern cohorts the variable LLM cost stayed below one dollar per exam, and the implementation surfaced five concrete patterns for keeping such systems reliable.

Core claim

Oral examinations retain an evidentiary link where written work no longer does, yet a 25-minute oral reviewed by two graders takes roughly 30 combined instructor and TA hours for 36 students. Viva separates the examination into a voice-AI module and the grading into a multi-model panel whose members score independently, read each other's assessments, and revise. Across 73 students in two semesters the grading-LLM cost remained under one dollar per exam within the existing ElevenLabs subscription; the system also exposed failures in multi-question phrasing, lack of randomization, and voice tone that led to five transferable patterns for building similar tools.

What carries the argument

Viva, a decomposed system that uses a voice AI agent to conduct personalized oral exams and a panel of three LLMs to grade the resulting transcripts independently before mutual revision.

If this is right

  • Oral exams can shift from end-of-term events to weekly or bi-weekly checks without linear growth in instructor time.
  • Instructors can allocate effort to question design and high-level review rather than conducting every session.
  • Constraining AI behavior through code instead of prompts improves consistency across students.
  • Keeping randomization outside the LLM prevents the agent from favoring easier or harder topics for some students.
  • Voice characteristics must be chosen and tested with the same care given to question content.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular decomposition pattern could transfer to other voice-based tutoring or interview tools outside formal courses.
  • Adding outcome measures such as follow-up exam performance would test whether the oral scores predict later mastery.
  • Deployments beyond 100 students would reveal whether the current per-exam cost scaling continues linearly.
  • The same multi-panel revision approach might reduce bias in automated feedback for written assignments.

Load-bearing premise

The multi-LLM grading panel produces scores that validly reflect student understanding.

What would settle it

A side-by-side comparison of the multi-LLM panel scores against independent human grader scores on the same set of oral-exam transcripts.

Figures

Figures reproduced from arXiv: 2603.18221 by Konstantinos Rizakos, Panos Ipeirotis.

Figure 1
Figure 1. Figure 1: End-to-end system architecture. Left: The examination pipeline decomposes each oral exam into three agent phases, with per-student context injected via dynamic variables. Right: The grading pipeline scores each transcript through two rounds of multi-model assessment followed by chair synthesis. Cases flagged for high disagreement are routed to human audit. and reliability, with Cronbach’s 𝛼 of 0.75–0.80 ve… view at source ↗
Figure 2
Figure 2. Figure 2: Dimension-level grading agreement after deliberation (180 assessments). Only 2 of 180 dimension-level grades showed disagreement of 2 or more points. Problem Framing Metrics & Econ. Risk & Ethics Experiment. Commun. 0 1 2 3 4 Mean Score (0 4 scale) 3.39 3.03 2.92 1.94 2.81 Performance by Topic [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examination duration vs. overall score. The [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Students in our AI/ML course submitted polished, well-argued project analyses. Then, in class discussion, we asked them to walk through a single choice from their own work. Many could not. The writing looked great. The understanding often wasn't. Oral examinations retain an evidentiary link where written work no longer does: a student who can reason aloud, defend a decision under follow-up, and adapt when pushed demonstrates something no submitted document can certify. The obstacle has always been cost. A 25-minute oral reviewed by two graders takes roughly 30 combined instructor and TA hours for 36 students; at 100 the format is untenable. Voice AI and automated grading change the arithmetic. We built Viva, a system that conducts a personalized oral exam, then grades the transcript with a panel of three LLMs that score independently, read each other's assessments, and revise. Across two undergraduate cohorts at NYU Stern (36 students in Fall 2025, 37 in Spring 2026), grading-LLM cost stayed under one dollar per exam within the ElevenLabs subscription covering our voice minutes; for deployments exceeding an equivalent credit pool, budget about a dollar per ten minutes of graded exam time, practical for weekly assignments, not just finals. The system also broke instructively: the agent asked several questions at once, failed to randomize topics across the cohort, and a voice cloned from the professor's came across as harsh, replaced in Spring 2026 with a calm preset. These failures, with an earlier finding that a monolithic agent handling both examination and grading proved unreliable, point to five candidate transferable patterns: decompose into single-purpose modules, constrain behavior with code rather than prompts, keep randomization out of the LLM, grade with a multi-model panel whose members disagree, and choose voice characteristics with the same care as question design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Viva, a Voice AI system for conducting personalized oral exams in an AI/ML course followed by automated grading of transcripts via a panel of three LLMs that score independently, review each other, and revise. Deployed with two undergraduate cohorts at NYU Stern (36 students Fall 2025, 37 Spring 2026), the system kept LLM grading costs under one dollar per exam. From observed failures (multi-question prompts, non-randomized topics, harsh cloned voice), the authors derive five transferable patterns: decompose into single-purpose modules, constrain behavior with code, keep randomization out of the LLM, grade with a multi-model panel whose members disagree, and choose voice characteristics carefully.

Significance. If the multi-LLM panel produces valid scores, the work offers a concrete, low-cost path to scalable oral assessment that could restore evidentiary value lost in written submissions. The explicit cost figures and systematic extraction of design patterns from real deployments provide immediately usable guidance for educators. The contribution is practical and timely for AI-assisted education but remains conditional on unshown validation of grading quality.

major comments (2)
  1. [Abstract] Abstract: the claim that 'grade with a multi-model panel whose members disagree' constitutes a transferable pattern is load-bearing for the central contribution, yet the manuscript supplies no quantitative support. It describes only that the LLMs 'score independently, read each other's assessments, and revise' without reporting inter-rater reliability (Cohen's kappa), Pearson/Spearman correlation with human graders on the same transcripts, or alignment with learning-outcome measures. This leaves the pattern resting on the untested assumption that multi-model disagreement plus revision improves validity over single-model or human grading.
  2. [Deployment and Results] Deployment description: while cost data are concrete (under $1 per exam within the ElevenLabs subscription), no table or section presents grading accuracy, inter-LLM agreement statistics, student performance deltas, or comparison against traditional oral exams. Without such evidence the assertion that the system 'retains an evidentiary link' to understanding cannot be evaluated and the generalizability of the five patterns is undercut.
minor comments (2)
  1. [Abstract] Abstract: the clause 'at 100 the format is untenable' is unclear; specify whether this refers to 100 students, 100 exams, or another quantity.
  2. The five patterns are listed in the abstract but would benefit from a short table in the main text that maps each pattern to the specific failure observed and the code-level or configuration change made.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for these focused comments on validation and evidence. Our manuscript reports a practical deployment of Viva and extracts design patterns from observed failures rather than presenting a controlled validation study of grading accuracy. We respond to each point below and have revised the manuscript to clarify the observational basis of our claims, add a limitations section, and moderate language on generalizability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'grade with a multi-model panel whose members disagree' constitutes a transferable pattern is load-bearing for the central contribution, yet the manuscript supplies no quantitative support. It describes only that the LLMs 'score independently, read each other's assessments, and revise' without reporting inter-rater reliability (Cohen's kappa), Pearson/Spearman correlation with human graders on the same transcripts, or alignment with learning-outcome measures. This leaves the pattern resting on the untested assumption that multi-model disagreement plus revision improves validity over single-model or human grading.

    Authors: We agree that no quantitative metrics such as Cohen's kappa, correlations with human graders, or alignment with learning outcomes are reported. The multi-model panel was introduced after a single-LLM grader produced inconsistent results in early internal tests; the pattern is offered as a candidate insight drawn from that deployment experience, not as a statistically validated improvement. We have revised the abstract to describe the patterns as 'observed' rather than asserted as transferable best practices and have added explicit language in the discussion calling for future comparative validation studies. revision: partial

  2. Referee: [Deployment and Results] Deployment description: while cost data are concrete (under $1 per exam within the ElevenLabs subscription), no table or section presents grading accuracy, inter-LLM agreement statistics, student performance deltas, or comparison against traditional oral exams. Without such evidence the assertion that the system 'retains an evidentiary link' to understanding cannot be evaluated and the generalizability of the five patterns is undercut.

    Authors: The paper centers on system design, per-exam costs, and lessons extracted from concrete failures across two small cohorts. We did not collect parallel human grades or run a controlled comparison against traditional oral exams, so no accuracy tables or performance deltas appear. The phrase 'retains an evidentiary link' refers to the inherent properties of oral assessment rather than new empirical results from Viva. We have inserted a limitations paragraph that directly acknowledges the absence of these statistics and have softened claims about the generalizability of the five patterns. revision: yes

standing simulated objections not resolved
  • We do not possess human-graded transcripts or inter-rater reliability data from the described deployments, as the study was not designed to include such comparisons.

Circularity Check

0 steps flagged

No circularity; observational patterns from deployed system are self-contained

full rationale

The paper describes the Viva system implementation, reports per-exam costs under one dollar across two cohorts, catalogs specific failure modes (multi-question prompts, non-randomized topics, harsh voice), and lists five candidate patterns drawn directly from those observations plus one prior internal trial. No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear. The central claims rest on empirical deployment logs rather than any reduction to inputs by construction or self-citation chains. This is a standard case-study report whose validity can be assessed against external replication or human-grader agreement data, none of which is required for the circularity check.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that LLM transcript grading can substitute for human judgment at acceptable accuracy; no free parameters, axioms, or invented entities are introduced beyond standard LLM usage.

axioms (1)
  • domain assumption LLM panels can produce valid oral-exam grades from transcripts without additional human calibration data
    Invoked implicitly when the authors treat the LLM scores as the operational grading mechanism.

pith-pipeline@v0.9.0 · 5861 in / 1267 out tokens · 33878 ms · 2026-05-21T10:00:43.860469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We implemented a “council of LLMs” (three models from different families that independently score the same transcript, then revise after seeing each other’s reasoning) using Claude, Gemini, and GPT-5... After deliberation, α=0.86 (dimension-level)

  • IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The recurring lesson is that behavioral constraints on LLMs must be enforced through architecture, not prompting alone... five candidate transferable patterns: decompose into single-purpose modules, constrain behavior with code rather than prompts...

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 2 internal anchors

  1. [1]

    Tiffany Bayley, Kyle D. S. Maclean, and Tessa Weidner. 2024. Back to the Future: Implementing Large-Scale Oral Exams.Management Teaching Review11, 1 (2024), 159–170. doi:10.1177/23792981241267744

  2. [2]

    Computing education in the era of generative AI,

    Paul Denny, James Prather, Brett A. Becker, James Finnie-Ansley, Arto Hellas, Juho Leinonen, Andrew Luxton-Reilly, Brent N. Reeves, Eddie Antonio Santos, and Sami Sarsa. 2024. Computing Education in the Era of Generative AI. Commun. ACM67, 2 (Jan. 2024), 56–67. doi:10.1145/3624720

  3. [3]

    ElevenLabs. 2024. Conversational AI Platform. https://elevenlabs.io/docs/conversational-ai/overview. Accessed January 2025

  4. [4]

    Andrea Fenton. 2025. Reconsidering the Use of Oral Exams and Assessments: An Old Way to Move Into a New Future. Educational Researcher54, 7 (2025), 430–436. doi:10.3102/0013189X251333638

  5. [5]

    Catherine Hartmann. 2025. Oral Exams for a Generative AI World: Managing Concerns and Logistics for Undergraduate Humanities Instruction.College Teaching(2025). doi:10.1080/87567555.2025.2558563

  6. [6]

    Brian Jabarian and Léa Henkel. 2025. Voice AI in Firms: A Natural Field Experiment on Automated Job Interviews. doi:10.2139/ssrn.5395709 SSRN 5395709

  7. [7]

    Yang Jiang, Jiangang Hao, Michael Fauss, and Chen Li. 2024. Detecting ChatGPT-generated essays in a large-scale writing assessment: Is there a bias against non-native English speakers?Computers and Education217 (2024), 105070. doi:10.1016/j.compedu.2024.105070

  8. [8]

    Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia , Vol. 1, No. 1, Article . Publication date: March 2026. Scalable and Personalized Oral Assessments Using Voice AI 11 Nerdel, Jürgen Pf...

  9. [9]

    Klaus Krippendorff. 2011. Computing Krippendorff’s Alpha-Reliability.Departmental Papers (ASC)(2011). University of Pennsylvania

  10. [10]

    Muhammed Ashraf Memon, Gordon Rowland Joughin, and Breda Memon. 2010. Oral assessment and postgraduate medical examinations: establishing conditions for validity, reliability and fairness.Advances in Health Sciences Education 15, 2 (2010), 277–289. doi:10.1007/s10459-008-9111-9

  11. [11]

    Ethan Mollick and Lilach Mollick. 2023. Assigning AI: Seven Approaches for Students, with Prompts. arXiv:2306.10052 [cs.CY] https://arxiv.org/abs/2306.10052

  12. [12]

    Shashi Nallaya, Sheridan Gentili, Scott Weeks, and Katherine Baldock. 2024. The validity, reliability, academic integrity and integration of oral assessments in higher education: A systematic review.Issues in Educational Research34, 2 (2024), 629–646. http://www.iier.org.au/iier34/nallaya.pdf

  13. [13]

    André Nitze. 2024. Future-proofing Education: A Prototype for Simulating Oral Examinations Using Large Language Models. arXiv:2401.06160 [cs.CY] https://arxiv.org/abs/2401.06160

  14. [14]

    Shermis and Jill Burstein (Eds.)

    Mark D. Shermis and Jill Burstein (Eds.). 2013.Handbook of Automated Essay Evaluation: Current Applications and New Directions. Routledge, New York

  15. [15]

    Pat Verga, Sebastian Hofstatter, Sophia Althammer, Yixuan Su, Aleksandra Piktus, Arkady Arkhangorodsky, Minjie Xu, Naomi White, and Patrick Lewis. 2024. Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models. arXiv:2404.18796 [cs.CL] https://arxiv.org/abs/2404.18796

  16. [16]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685 [cs.CL] https://arxiv.org/abs/2306.05685 , Vol. 1, No. 1, Article . Publication date: March 2026