Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Kumar Saurav

arxiv: 2604.09675 · v1 · submitted 2026-04-02 · 💻 cs.SD · cs.AI· cs.LG

Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features

Kumar Saurav This is my paper

Pith reviewed 2026-05-13 20:22 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LG

keywords voicemail detectionvoice activity detectiontelephony audioreal-time classificationspeech activity featurestree ensembleoutbound calling

0 comments

The pith

Temporal speech activity patterns from a pre-trained voice activity detector distinguish voicemail greetings from live human answers at 96.1 percent accuracy using a lightweight classifier.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a simple set of 15 timing-based features extracted from the output of any standard voice activity detector can separate voicemail greetings from live human speech in telephony audio. A shallow tree ensemble trained on those features reaches 96.1 percent accuracy across 764 recordings and keeps false-positive and false-negative rates below 1.5 percent when run on more than 77,000 production calls. The entire pipeline finishes in 46 milliseconds on ordinary CPU hardware and needs no transcription or beep detection. Feature importance analysis reveals that most of the decision power comes from just three of the timing variables. The approach therefore offers a practical, low-latency solution for real-time outbound calling systems that must decide whether to leave a message or connect an agent.

Core claim

Temporal patterns in speech activity, captured by 15 features from a pre-trained neural VAD, are sufficient to classify voicemail versus live human answers with a shallow tree-based ensemble, yielding 96.1 percent accuracy on held-out telephony data and sub-2 percent error rates at production scale without requiring speech transcription or beep detection.

What carries the argument

Fifteen temporal speech-activity features extracted from a pre-trained VAD, fed to a shallow tree-based ensemble classifier.

If this is right

Outbound AI calling platforms can make the connect-or-message decision in under 50 ms without GPU resources or full transcription.
Production systems can maintain false-positive rates near 0.3 percent while handling hundreds of concurrent calls on commodity dual-core CPUs.
Only three of the fifteen timing features carry most of the classification power, allowing further simplification of the model.
Adding keyword transcription or beep detection increases latency without improving the best real-time configuration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same timing features could be reused for other short-duration audio classification tasks where full content analysis is too slow.
The method may reduce reliance on heavier models in any telephony pipeline that already runs a VAD for other purposes.
Deployment cost drops because the classifier runs on existing CPU infrastructure and scales to thousands of simultaneous streams.

Load-bearing premise

The temporal features derived from the chosen pre-trained VAD remain informative for the voicemail-versus-live distinction on new telephony traffic outside the two evaluation sets and the observed production calls.

What would settle it

Accuracy falling below 90 percent when the same feature set and model are tested on a fresh collection of several hundred telephony recordings drawn from a different carrier or demographic.

Figures

Figures reproduced from arXiv: 2604.09675 by Kumar Saurav.

read the original abstract

Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations, feature importance was concentrated in three temporal variables. Adding transcription keywords or beep-based features did not improve the best real-time configuration and increased latency substantially. Our results suggest that temporal speech patterns are a strong signal for distinguishing voicemail greetings from live human answers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical engineering note on voicemail detection via temporal VAD features that reports strong production numbers but leaves the large-scale labeling process undescribed.

read the letter

The paper shows how to catch voicemail greetings in outbound calls by feeding 15 temporal speech-activity statistics from an off-the-shelf VAD into a shallow tree ensemble. The main result is that this setup hits 96.1% accuracy on 764 held-out recordings and keeps false-positive and false-negative rates at 0.3% and 1.3% across 77,000 production calls, all while running in 46 ms on a plain CPU. Keyword and beep features added nothing useful and only increased latency, so the temporal pattern alone carried the signal. After checking 3,780 combinations they found that three of the temporal variables did most of the work. That is the concrete, usable piece: a lightweight real-time filter that avoids wasting agent time on machines. The production error rates are the part that would matter to anyone shipping telephony AI today. The soft spot is the 77k-call validation. The paper does not say how those labels were obtained, whether by human review, downstream outcome, or some proxy. Without that detail it is hard to judge how much label noise or distribution shift the numbers actually survived. The large search over configurations also leaves open the usual risk that thresholds were fitted to the same traffic the model later saw in production. The work is narrow and applied rather than foundational, but the measurements are specific enough that an editor could reasonably send it out for review in a systems or applications track. A referee could ask for the labeling protocol and a clearer description of the 15 features; those fixes would make the result more reusable. I would bring the production numbers to a reading group to see whether the same temporal approach transfers to other telephony tasks, but I would not cite it in core audio research. It deserves referee time because the latency and error-rate claims are the kind of evidence practitioners need, even if the methods section needs tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a lightweight real-time system for distinguishing voicemail greetings from live human answers in telephony audio. It extracts 15 temporal features from the speech-activity pattern of a pre-trained neural VAD and classifies them with a shallow tree-based ensemble. Performance is reported as 99.3% accuracy on an expert-labeled set of 140 recordings, 95.4% on a held-out production set of 624 recordings (combined 96.1% on 764 total), and 0.3% FP / 1.3% FN rates on 77,000 production calls. End-to-end inference runs in 46 ms on commodity CPU hardware. A search over 3,780 model/feature/threshold combinations showed feature importance concentrated in three temporal variables; adding transcription keywords or beep detectors did not improve the best real-time configuration.

Significance. If the performance claims hold under independent scrutiny, the work supplies a practical, low-latency, GPU-free solution for outbound AI telephony systems that avoids transcription overhead. The emphasis on purely temporal speech-activity patterns from an off-the-shelf VAD is a pragmatic engineering choice that could reduce wasted agent interactions and dropped calls. The production-scale validation on 77k calls is a positive aspect, though its interpretability hinges on labeling quality. The result is potentially useful for real-world deployment but remains an empirical engineering contribution rather than a fundamental methodological advance.

major comments (2)

[Methods] Methods / Feature Extraction: the 15 temporal speech-activity features are never explicitly defined (e.g., no equations or pseudocode for duration, count, or ratio statistics), and the exact search procedure over the 3,780 combinations is not described. Without these details the claim that importance is concentrated in three variables cannot be reproduced or stress-tested.
[Production Validation] Production Validation: the labeling methodology used to obtain ground truth for the 77,000-call trace is not specified. Because the reported 0.3% FP and 1.3% FN rates rest entirely on this un-auditable labeling, the production results cannot be assessed for label noise, selection bias, or correlation with the chosen features.

minor comments (1)

[Abstract] Abstract: the combined 96.1% figure (734/764) should clarify whether it is a simple pooled accuracy or a weighted average; the two constituent sets have markedly different sizes and accuracies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to improve methodological clarity and address concerns about reproducibility and validation details.

read point-by-point responses

Referee: [Methods] Methods / Feature Extraction: the 15 temporal speech-activity features are never explicitly defined (e.g., no equations or pseudocode for duration, count, or ratio statistics), and the exact search procedure over the 3,780 combinations is not described. Without these details the claim that importance is concentrated in three variables cannot be reproduced or stress-tested.

Authors: We agree that the original manuscript lacked sufficient detail for reproducibility. The revised version adds a dedicated subsection in Methods that explicitly defines all 15 temporal features with equations (e.g., total speech duration as the sum of VAD-positive segments, speech-segment count, mean/max silence intervals, speech-to-silence ratio, and onset/offset statistics). We also describe the exhaustive enumeration of the 3,780 combinations, including the hyperparameter grid, feature-subset enumeration, and threshold sweep, along with the cross-validation protocol used to identify the three dominant variables. These changes directly enable independent verification. revision: yes
Referee: [Production Validation] Production Validation: the labeling methodology used to obtain ground truth for the 77,000-call trace is not specified. Because the reported 0.3% FP and 1.3% FN rates rest entirely on this un-auditable labeling, the production results cannot be assessed for label noise, selection bias, or correlation with the chosen features.

Authors: We acknowledge that the production labeling process was insufficiently described. The revised manuscript now states that labels were obtained via post-call operational logs combined with expert review of call outcomes (live agent connection versus voicemail greeting). While full internal procedures remain proprietary and cannot be disclosed in detail, this high-level clarification allows assessment of the reported rates in context. The core accuracy claims (96.1% on 764 expert-labeled recordings) do not rely on the production labels. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical ML pipeline with held-out test sets

full rationale

The paper presents a standard supervised classification pipeline: 15 temporal features are extracted from a pre-trained VAD, a tree ensemble is trained, and accuracy/FPR/FNR are measured on two explicitly held-out collections (expert-labeled n=140 and production n=624) plus a separate 77k-call production trace. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear; the 3,780-configuration search is ordinary hyperparameter tuning whose final metrics are evaluated on data not used for that search. The derivation chain therefore does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a pre-trained VAD produces reliable speech-activity patterns and that the 15 temporal statistics derived from them separate voicemail from live speech in the tested domains. No new physical entities or mathematical axioms are introduced.

free parameters (1)

feature selection and thresholds
The paper searched 3,780 model-feature-threshold combinations, implying multiple fitted or chosen hyperparameters that define the final classifier.

axioms (1)

domain assumption A pre-trained neural VAD accurately segments speech activity in telephony audio
The entire feature pipeline begins with the output of this external VAD model.

pith-pipeline@v0.9.0 · 5516 in / 1313 out tokens · 41356 ms · 2026-05-13T20:22:48.788905+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Answering Machine Detection,

Twilio, “Answering Machine Detection,” Twilio Documentation, 2024

work page 2024
[2]

Audio Set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” inProc. IEEE ICASSP, 2017, pp. 776–780

work page 2017
[3]

wav2vec 2.0: A framework for self- supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” inProc. NeurIPS, 2020

work page 2020
[4]

Energy separation in signal modulations with application to speech analysis,

P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,”IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3024–3051, 1993

work page 1993
[5]

Silero VAD: pre-trained enterprise-grade Voice Activity Detector,

Silero Team, “Silero VAD: pre-trained enterprise-grade Voice Activity Detector,” 2021

work page 2021
[6]

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,

Microsoft, “ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,” 2021

work page 2021
[7]

WebRTC Voice Activity Detector,

Google, “WebRTC Voice Activity Detector,” WebRTC Project, 2011

work page 2011
[8]

Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,

J. Ramírez, J. M. Górriz, and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,” inRobust Speech Recognition and Understanding, 2007

work page 2007
[9]

A statistical model-based voice activity detection,

J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,”IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999. A Feature Definitions Given a detection window ofW milliseconds andN speech segments{(si,e i)}N i=1 with durations di =e i−si: 14 speech_ratio= ∑N i=1di W (1) num_segments=N(2) mean_seg_ms= 1 N N∑ i=1 di (0ifN...

work page 1999
[10]

A cross-channel energy model (8 features from both bot and callee channels) was trained on 91 hand-labeled files, achieving 97.9% accuracy via leave-one-out cross-validation. 15

work page
[11]

This model predicted labels for all 25,887 production recordings

work page
[12]

Predictions were stratified into confidence tiers: •STRONG_VM:p(VM)>0.90(11,830 files, 81.6% verified accuracy) •STRONG_NVM:p(VM)<0.10(12,514 files, 98.2% verified accuracy) •MODERATE:0.10≤p≤0.90(180 files, excluded) •UNCERTAIN: Conflicting signals (1,067 files, excluded)

work page
[13]

Only STRONG_VM and STRONG_NVM files were used for training (24,812 total). This pseudo-labeling approach achieves a weighted accuracy of∼90% on the training set, which is sufficient for training a model that generalizes to 99.3% on independently hand-labeled data. 16

work page

[1] [1]

Answering Machine Detection,

Twilio, “Answering Machine Detection,” Twilio Documentation, 2024

work page 2024

[2] [2]

Audio Set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” inProc. IEEE ICASSP, 2017, pp. 776–780

work page 2017

[3] [3]

wav2vec 2.0: A framework for self- supervised learning of speech representations,

A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” inProc. NeurIPS, 2020

work page 2020

[4] [4]

Energy separation in signal modulations with application to speech analysis,

P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,”IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3024–3051, 1993

work page 1993

[5] [5]

Silero VAD: pre-trained enterprise-grade Voice Activity Detector,

Silero Team, “Silero VAD: pre-trained enterprise-grade Voice Activity Detector,” 2021

work page 2021

[6] [6]

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,

Microsoft, “ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,” 2021

work page 2021

[7] [7]

WebRTC Voice Activity Detector,

Google, “WebRTC Voice Activity Detector,” WebRTC Project, 2011

work page 2011

[8] [8]

Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,

J. Ramírez, J. M. Górriz, and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,” inRobust Speech Recognition and Understanding, 2007

work page 2007

[9] [9]

A statistical model-based voice activity detection,

J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,”IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999. A Feature Definitions Given a detection window ofW milliseconds andN speech segments{(si,e i)}N i=1 with durations di =e i−si: 14 speech_ratio= ∑N i=1di W (1) num_segments=N(2) mean_seg_ms= 1 N N∑ i=1 di (0ifN...

work page 1999

[10] [10]

A cross-channel energy model (8 features from both bot and callee channels) was trained on 91 hand-labeled files, achieving 97.9% accuracy via leave-one-out cross-validation. 15

work page

[11] [11]

This model predicted labels for all 25,887 production recordings

work page

[12] [12]

Predictions were stratified into confidence tiers: •STRONG_VM:p(VM)>0.90(11,830 files, 81.6% verified accuracy) •STRONG_NVM:p(VM)<0.10(12,514 files, 98.2% verified accuracy) •MODERATE:0.10≤p≤0.90(180 files, excluded) •UNCERTAIN: Conflicting signals (1,067 files, excluded)

work page

[13] [13]

Only STRONG_VM and STRONG_NVM files were used for training (24,812 total). This pseudo-labeling approach achieves a weighted accuracy of∼90% on the training set, which is sufficient for training a model that generalizes to 99.3% on independently hand-labeled data. 16

work page