Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

Avinash Karn; Mayank Malhotra; Mouli V; Prakhar Mehrotra; Rahul Khedar

arxiv: 2606.30294 · v1 · pith:DN6OY3BAnew · submitted 2026-06-29 · 💻 cs.AI · cs.HC· cs.SE

Rehearsed Multi-Agent Live Product Demonstrations with Real-Time Voice Question Answering

Rahul Khedar , Mayank Malhotra , Avinash Karn , Mouli V , Prakhar Mehrotra This is my paper

Pith reviewed 2026-06-30 06:00 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.SE

keywords multi-agent systemslive demonstrationsweb applicationssource code analysisnarration synchronizationvoice question answeringrehearsal loopUI exploration

0 comments

The pith

Rhetor is a multi-agent system that converts a running web application and its source code into a rehearsed live demonstration with synchronized narration and real-time voice question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Live product demonstrations require selecting features, performing interactions, narrating them, and answering questions in real time, which is expensive for software teams. The paper presents Rhetor as a system that accepts a running web app and its source-code repository to automate this process. It combines UI exploration with code analysis through a cross-modal representation, uses a constrained scripter with semantic locators, runs a rehearsal loop for convergence, and enforces synchronization between actions and narration audio. Experiments on four applications show high locator success rates after rehearsal, with a proposed benchmark to measure each component.

Core claim

Rhetor takes a running web application and its source-code repository as input and produces a rehearsed live demonstration with segment-synchronized narration and real-time voice question answering, using a cross-modal feature representation, grounded scripter, pre-presentation rehearsal loop, and runtime synchronization invariant.

What carries the argument

The cross-modal feature representation that merges UI exploration with source-code analysis into features tagged with discrete focus tiers, which supports a grounded scripter and rehearsal loop for reliable script convergence.

If this is right

The rehearsal loop with graceful degradation permits demos to continue even when some actions cannot be executed.
The runtime synchronization invariant ensures each browser action aligns exactly with the end of its narration segment.
Locator repair during rehearsal drives convergence to full success on applications like Excalidraw.
A ten-metric benchmark protocol across six application categories can test whether each architectural choice improves outcomes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could reduce presenter preparation time by running the system on updated codebases before each demo event.
The same tiered feature approach might support automated testing scripts that remain stable across UI changes.
Real-time voice Q&A could be extended by linking answers directly to the same cross-modal features used for narration.

Load-bearing premise

Merging UI exploration with source-code analysis into tiered features allows the scripter to produce scripts that converge reliably via the rehearsal loop.

What would settle it

If a new application yields a locator-firing rate below 0.5 after multiple rehearsal iterations, the claim of reliable convergence would be falsified.

Figures

Figures reproduced from arXiv: 2606.30294 by Avinash Karn, Mayank Malhotra, Mouli V, Prakhar Mehrotra, Rahul Khedar.

**Figure 1.** Figure 1: The five-phase RHETOR pipeline. Phases 1a and 1b execute concurrently and produce the navigation graph G and code analysis C consumed by Phase 2. Edge labels denote the typed artifact passed between phases; full definitions are in Section 3. differs in regime, so end-to-end Phase 1 latency is T1 = max T1a, T1b ≤ T1a + T1b. (3) 4.1 UI exploration The crawl is bounded by |V | ≤ κp and depth(v) ≤ κd for κp … view at source ↗

**Figure 2.** Figure 2: Rehearsal repair loop. Convergence at σi ≥ τ ; otherwise, repair and iterate up to Imax, then degrade residual failures to NARRATION_ONLY. event and clears an internal narration_event. The client begins TTS playback while the server blocks on narration_event.wait(∆max). The client’s audio onended fires a narration_done message that releases the wait, and the server then emits the action event. Provided ∆m… view at source ↗

**Figure 3.** Figure 3: Segment-completion handshake. The server [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Live runtime architecture. The target application is served back to the client under the same origin via the [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Live product demonstrations are a recurring, high-cost activity in software organizations: a human presenter must select features, dispatch the corresponding interactions on a running application, narrate them coherently, and answer questions in real time. Existing automation addresses only fragments -- generalist browser agents target instruction-conditioned task completion, and demo-video tools produce fixed MP4 artifacts that cannot be questioned and silently break under interface drift. We propose Rhetor, a multi-agent system that takes a running web application and its source-code repository as input and produces a rehearsed live demonstration with segment-synchronized narration and real-time voice question answering. The architectural contributions are a cross-modal feature representation that merges UI exploration with source-code analysis into features tagged with discrete focus tiers, a grounded scripter constrained to UI elements observed during exploration and dispatched through multi-strategy semantic locators, a pre-presentation rehearsal loop with explicit convergence and graceful degradation to narration-only segments, and a runtime synchronization invariant that ties each browser action to the audio-end event of its narration segment. Across six pipeline sessions on four deployed applications -- including the public-domain whiteboard application Excalidraw -- the rehearser's internal locator-firing rate (sigma-bar) spans 0.31-1.00 over 147 scripted actions; on the substantial workload (53 actions, full tier differentiation), sigma-bar is approximately 0.92, and on the public-domain reference point the locator-repair step drives convergence to sigma-bar = 1.00 at iteration 2. We additionally define a benchmark protocol of ten metrics across six application categories that would establish, beyond the case study, whether each design choice contributes positively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rhetor assembles a multi-agent pipeline for rehearsed live demos with voice QA but rests on case-study metrics without executing its own proposed benchmark.

read the letter

The core of this paper is a system called Rhetor that chains UI exploration, source-code analysis, a rehearsal loop with explicit convergence, and a runtime sync rule to produce live demos that can handle real-time voice questions. The combination of those pieces into one pipeline for this specific task is what stands out as new compared to the general browser agents or fixed video tools mentioned.

It does a few things cleanly. The authors report locator-firing rates (sigma-bar) from actual runs on four applications, reaching about 0.92 on a 53-action workload and full convergence to 1.00 after two rehearsal iterations on Excalidraw. They also spell out a ten-metric benchmark across six application categories that would let future work test each component. That protocol is a concrete step even if it was not run here.

The main limitation is that all numbers come from six sessions on four apps with no error bars, no statistical tests, and no ablations. The paper itself notes that the benchmark would be needed to show whether the cross-modal features or the rehearsal loop actually drive the results, but that step was skipped. Scope is also narrow to web apps that accept the kinds of inputs the system can locate.

This is worth a look for anyone building multi-agent tools that need to stay grounded in both UI state and code. A reader who wants ideas for rehearsal loops or cross-modal tagging could pull useful details. It is not yet strong enough to change how demos are done in practice.

I would send it to peer review. The architecture is described clearly enough and the case results give a starting point, but referees would need to press on the missing benchmark and the lack of controlled comparisons.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Rhetor, a multi-agent system that takes a running web application and its source-code repository as input and outputs a rehearsed live demonstration featuring segment-synchronized narration and real-time voice question answering. Architectural elements include a cross-modal feature representation merging UI exploration with source-code analysis via discrete focus tiers, a grounded scripter constrained to observed UI elements and using multi-strategy semantic locators, a pre-presentation rehearsal loop with explicit convergence criteria and graceful degradation, and a runtime synchronization invariant linking browser actions to narration audio-end events. Empirical results are reported from six pipeline sessions on four deployed applications (including Excalidraw), with internal locator-firing rates (sigma-bar) spanning 0.31-1.00 over 147 actions; on a 53-action workload sigma-bar is ~0.92 and on Excalidraw the repair step yields convergence to 1.00 at iteration 2. The paper also defines (but does not execute) a benchmark protocol of ten metrics across six application categories.

Significance. If the cross-modal representation and rehearsal loop reliably produce convergent scripts, the work could meaningfully reduce the recurring human cost of live product demonstrations in software organizations. The case-study metrics provide concrete evidence of feasibility on real deployed applications, including a public-domain reference, and the explicit definition of a benchmark protocol is a constructive contribution toward falsifiable evaluation. However, because the reported results are limited to six sessions without ablations, comparisons, or execution of the proposed benchmark, the demonstrated impact remains preliminary rather than conclusive.

major comments (2)

[Abstract] Abstract: The central claim that the cross-modal feature representation (with discrete focus tiers) enables a grounded scripter whose scripts converge reliably via the rehearsal loop rests on sigma-bar values from six sessions on four applications. The manuscript itself states that the benchmark protocol of ten metrics across six categories 'would establish, beyond the case study, whether each design choice contributes positively,' indicating this protocol was not executed. This is load-bearing because the reliability claim cannot be assessed without systematic ablations or the defined benchmark.
[Abstract] Abstract: The reported performance figures (sigma-bar of 0.92 on the 53-action workload; convergence to 1.00 at iteration 2 on Excalidraw) are presented without error bars, statistical tests, or a full experimental protocol. This weakens the empirical grounding for the claim that the rehearsal loop drives reliable convergence across the 147 scripted actions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The manuscript presents the reported results explicitly as a case study on four deployed applications while defining (but not executing) a benchmark protocol for stronger claims. We address each major comment below and propose targeted revisions to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the cross-modal feature representation (with discrete focus tiers) enables a grounded scripter whose scripts converge reliably via the rehearsal loop rests on sigma-bar values from six sessions on four applications. The manuscript itself states that the benchmark protocol of ten metrics across six categories 'would establish, beyond the case study, whether each design choice contributes positively,' indicating this protocol was not executed. This is load-bearing because the reliability claim cannot be assessed without systematic ablations or the defined benchmark.

Authors: We agree that the benchmark protocol was defined but not executed, as the manuscript already states. The central claim is scoped to the observed behavior in the six pipeline sessions: the cross-modal features, grounded scripter, and rehearsal loop produced sigma-bar values of 0.31-1.00 (with 0.92 on the 53-action workload and convergence to 1.00 at iteration 2 on Excalidraw). These are internal, per-session metrics of locator success during rehearsal, not a general reliability guarantee. We will revise the abstract to more explicitly frame the contribution as a feasibility case study on real applications (including a public-domain reference) and to note that the defined benchmark remains future work for systematic ablations. revision: partial
Referee: [Abstract] Abstract: The reported performance figures (sigma-bar of 0.92 on the 53-action workload; convergence to 1.00 at iteration 2 on Excalidraw) are presented without error bars, statistical tests, or a full experimental protocol. This weakens the empirical grounding for the claim that the rehearsal loop drives reliable convergence across the 147 scripted actions.

Authors: The figures are descriptive metrics computed directly from the six individual pipeline sessions on specific applications and workloads (147 actions total). Because each session constitutes a complete run rather than a sample from a larger population, standard error bars or inferential statistical tests are not applicable. We will add a clarifying sentence in the abstract and results section stating that these values are case-study observations of the rehearsal process rather than aggregated experimental statistics. revision: yes

Circularity Check

0 steps flagged

No circularity; case-study metrics are direct observations, not self-defined

full rationale

The paper describes an architectural system (cross-modal features, grounded scripter, rehearsal loop) and reports sigma-bar locator-firing rates as direct measurements from six sessions on four applications. No equations, fitted parameters, or predictions appear that reduce by construction to inputs. The text explicitly distinguishes the presented case studies from an unexecuted benchmark protocol of ten metrics, so the reliability claims rest on observed runs rather than tautological re-labeling. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is therefore self-contained engineering description plus empirical reporting.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, mathematical axioms, or ungrounded invented entities beyond named system components are identifiable.

pith-pipeline@v0.9.1-grok · 5854 in / 1308 out tokens · 50696 ms · 2026-06-30T06:00:38.521004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 6 canonical work pages · 3 internal anchors

[1]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su. Mind2Web: Towards a generalist agent for the web. InNeurIPS Datasets and Benchmarks,
[2]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al. WebArena: A realistic web environment for building autonomous agents. arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

J. Y . Koh et al. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

H. He et al. WebV oyager: Building an end-to-end web agent with large multimodal models. arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382,

T. Xue et al. An illusion of progress? Assessing the current state of web agents. arXiv:2504.01382, 2025

work page arXiv 2025
[6]

MolmoWeb: An open agent for automating web tasks

Allen Institute for AI. MolmoWeb: An open agent for automating web tasks. Technical report, 2026

2026
[7]

Vardanyan

A. Vardanyan. Building browser agents: Architecture, security, and practical solutions. arXiv:2511.19477, 2025

work page arXiv 2025
[8]

Khedar et al

R. Khedar et al. State-grounded multi-agent syn- thetic data generation for tool-augmented LLMs. arXiv:2606.16307, 2026

work page arXiv 2026
[9]

Realtime API: speech-to-speech over Web- Socket

OpenAI. Realtime API: speech-to-speech over Web- Socket. Technical documentation, 2024

2024
[10]

Y . Madkour. DemoPilot: autonomous demo video agent. GitHub, 2026

2026
[11]

N. Holas. LooK: one-command product demo videos. GitHub, 2026

2026
[12]

GitHub, 2026

NeuraScreen: JSON-driven demo video generator. GitHub, 2026

2026
[13]

F. Mathieu. DemoDSL. GitHub, 2026. 14

2026

[1] [1]

X. Deng, Y . Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y . Su. Mind2Web: Towards a generalist agent for the web. InNeurIPS Datasets and Benchmarks,

[2] [2]

WebArena: A Realistic Web Environment for Building Autonomous Agents

S. Zhou et al. WebArena: A realistic web environment for building autonomous agents. arXiv:2307.13854, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

J. Y . Koh et al. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

H. He et al. WebV oyager: Building an end-to-end web agent with large multimodal models. arXiv:2401.13919, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

An illusion of progress? assessing the current state of web agents.arXiv preprint arXiv:2504.01382,

T. Xue et al. An illusion of progress? Assessing the current state of web agents. arXiv:2504.01382, 2025

work page arXiv 2025

[6] [6]

MolmoWeb: An open agent for automating web tasks

Allen Institute for AI. MolmoWeb: An open agent for automating web tasks. Technical report, 2026

2026

[7] [7]

Vardanyan

A. Vardanyan. Building browser agents: Architecture, security, and practical solutions. arXiv:2511.19477, 2025

work page arXiv 2025

[8] [8]

Khedar et al

R. Khedar et al. State-grounded multi-agent syn- thetic data generation for tool-augmented LLMs. arXiv:2606.16307, 2026

work page arXiv 2026

[9] [9]

Realtime API: speech-to-speech over Web- Socket

OpenAI. Realtime API: speech-to-speech over Web- Socket. Technical documentation, 2024

2024

[10] [10]

Y . Madkour. DemoPilot: autonomous demo video agent. GitHub, 2026

2026

[11] [11]

N. Holas. LooK: one-command product demo videos. GitHub, 2026

2026

[12] [12]

GitHub, 2026

NeuraScreen: JSON-driven demo video generator. GitHub, 2026

2026

[13] [13]

F. Mathieu. DemoDSL. GitHub, 2026. 14

2026