Real-Time Voicemail Detection in Telephony Audio Using Temporal Speech Activity Features
Pith reviewed 2026-05-13 20:22 UTC · model grok-4.3
The pith
Temporal speech activity patterns from a pre-trained voice activity detector distinguish voicemail greetings from live human answers at 96.1 percent accuracy using a lightweight classifier.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Temporal patterns in speech activity, captured by 15 features from a pre-trained neural VAD, are sufficient to classify voicemail versus live human answers with a shallow tree-based ensemble, yielding 96.1 percent accuracy on held-out telephony data and sub-2 percent error rates at production scale without requiring speech transcription or beep detection.
What carries the argument
Fifteen temporal speech-activity features extracted from a pre-trained VAD, fed to a shallow tree-based ensemble classifier.
If this is right
- Outbound AI calling platforms can make the connect-or-message decision in under 50 ms without GPU resources or full transcription.
- Production systems can maintain false-positive rates near 0.3 percent while handling hundreds of concurrent calls on commodity dual-core CPUs.
- Only three of the fifteen timing features carry most of the classification power, allowing further simplification of the model.
- Adding keyword transcription or beep detection increases latency without improving the best real-time configuration.
Where Pith is reading between the lines
- The same timing features could be reused for other short-duration audio classification tasks where full content analysis is too slow.
- The method may reduce reliance on heavier models in any telephony pipeline that already runs a VAD for other purposes.
- Deployment cost drops because the classifier runs on existing CPU infrastructure and scales to thousands of simultaneous streams.
Load-bearing premise
The temporal features derived from the chosen pre-trained VAD remain informative for the voicemail-versus-live distinction on new telephony traffic outside the two evaluation sets and the observed production calls.
What would settle it
Accuracy falling below 90 percent when the same feature set and model are tested on a fresh collection of several hundred telephony recordings drawn from a different carrier or demographic.
Figures
read the original abstract
Outbound AI calling systems must distinguish voicemail greetings from live human answers in real time to avoid wasted agent interactions and dropped calls. We present a lightweight approach that extracts 15 temporal features from the speech activity pattern of a pre-trained neural voice activity detector (VAD), then classifies with a shallow tree-based ensemble. Across two evaluation sets totaling 764 telephony recordings, the system achieves a combined 96.1% accuracy (734/764), with 99.3% (139/140) on an expert-labeled test set and 95.4% (595/624) on a held-out production set. In production validation over 77,000 calls, it maintained a 0.3% false positive rate and 1.3% false negative rate. End-to-end inference completes in 46 ms on a commodity dual-core CPU with no GPU, supporting 380+ concurrent WebSocket calls. In our search over 3,780 model, feature, and threshold combinations, feature importance was concentrated in three temporal variables. Adding transcription keywords or beep-based features did not improve the best real-time configuration and increased latency substantially. Our results suggest that temporal speech patterns are a strong signal for distinguishing voicemail greetings from live human answers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a lightweight real-time system for distinguishing voicemail greetings from live human answers in telephony audio. It extracts 15 temporal features from the speech-activity pattern of a pre-trained neural VAD and classifies them with a shallow tree-based ensemble. Performance is reported as 99.3% accuracy on an expert-labeled set of 140 recordings, 95.4% on a held-out production set of 624 recordings (combined 96.1% on 764 total), and 0.3% FP / 1.3% FN rates on 77,000 production calls. End-to-end inference runs in 46 ms on commodity CPU hardware. A search over 3,780 model/feature/threshold combinations showed feature importance concentrated in three temporal variables; adding transcription keywords or beep detectors did not improve the best real-time configuration.
Significance. If the performance claims hold under independent scrutiny, the work supplies a practical, low-latency, GPU-free solution for outbound AI telephony systems that avoids transcription overhead. The emphasis on purely temporal speech-activity patterns from an off-the-shelf VAD is a pragmatic engineering choice that could reduce wasted agent interactions and dropped calls. The production-scale validation on 77k calls is a positive aspect, though its interpretability hinges on labeling quality. The result is potentially useful for real-world deployment but remains an empirical engineering contribution rather than a fundamental methodological advance.
major comments (2)
- [Methods] Methods / Feature Extraction: the 15 temporal speech-activity features are never explicitly defined (e.g., no equations or pseudocode for duration, count, or ratio statistics), and the exact search procedure over the 3,780 combinations is not described. Without these details the claim that importance is concentrated in three variables cannot be reproduced or stress-tested.
- [Production Validation] Production Validation: the labeling methodology used to obtain ground truth for the 77,000-call trace is not specified. Because the reported 0.3% FP and 1.3% FN rates rest entirely on this un-auditable labeling, the production results cannot be assessed for label noise, selection bias, or correlation with the chosen features.
minor comments (1)
- [Abstract] Abstract: the combined 96.1% figure (734/764) should clarify whether it is a simple pooled accuracy or a weighted average; the two constituent sets have markedly different sizes and accuracies.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We have revised the manuscript to improve methodological clarity and address concerns about reproducibility and validation details.
read point-by-point responses
-
Referee: [Methods] Methods / Feature Extraction: the 15 temporal speech-activity features are never explicitly defined (e.g., no equations or pseudocode for duration, count, or ratio statistics), and the exact search procedure over the 3,780 combinations is not described. Without these details the claim that importance is concentrated in three variables cannot be reproduced or stress-tested.
Authors: We agree that the original manuscript lacked sufficient detail for reproducibility. The revised version adds a dedicated subsection in Methods that explicitly defines all 15 temporal features with equations (e.g., total speech duration as the sum of VAD-positive segments, speech-segment count, mean/max silence intervals, speech-to-silence ratio, and onset/offset statistics). We also describe the exhaustive enumeration of the 3,780 combinations, including the hyperparameter grid, feature-subset enumeration, and threshold sweep, along with the cross-validation protocol used to identify the three dominant variables. These changes directly enable independent verification. revision: yes
-
Referee: [Production Validation] Production Validation: the labeling methodology used to obtain ground truth for the 77,000-call trace is not specified. Because the reported 0.3% FP and 1.3% FN rates rest entirely on this un-auditable labeling, the production results cannot be assessed for label noise, selection bias, or correlation with the chosen features.
Authors: We acknowledge that the production labeling process was insufficiently described. The revised manuscript now states that labels were obtained via post-call operational logs combined with expert review of call outcomes (live agent connection versus voicemail greeting). While full internal procedures remain proprietary and cannot be disclosed in detail, this high-level clarification allows assessment of the reported rates in context. The core accuracy claims (96.1% on 764 expert-labeled recordings) do not rely on the production labels. revision: partial
Circularity Check
No circularity: empirical ML pipeline with held-out test sets
full rationale
The paper presents a standard supervised classification pipeline: 15 temporal features are extracted from a pre-trained VAD, a tree ensemble is trained, and accuracy/FPR/FNR are measured on two explicitly held-out collections (expert-labeled n=140 and production n=624) plus a separate 77k-call production trace. No equations, self-definitional loops, fitted-input predictions, or load-bearing self-citations appear; the 3,780-configuration search is ordinary hyperparameter tuning whose final metrics are evaluated on data not used for that search. The derivation chain therefore does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- feature selection and thresholds
axioms (1)
- domain assumption A pre-trained neural VAD accurately segments speech activity in telephony audio
Reference graph
Works this paper leans on
-
[1]
Twilio, “Answering Machine Detection,” Twilio Documentation, 2024
work page 2024
-
[2]
Audio Set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” inProc. IEEE ICASSP, 2017, pp. 776–780
work page 2017
-
[3]
wav2vec 2.0: A framework for self- supervised learning of speech representations,
A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self- supervised learning of speech representations,” inProc. NeurIPS, 2020
work page 2020
-
[4]
Energy separation in signal modulations with application to speech analysis,
P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,”IEEE Trans. Signal Processing, vol. 41, no. 10, pp. 3024–3051, 1993
work page 1993
-
[5]
Silero VAD: pre-trained enterprise-grade Voice Activity Detector,
Silero Team, “Silero VAD: pre-trained enterprise-grade Voice Activity Detector,” 2021
work page 2021
-
[6]
ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,
Microsoft, “ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator,” 2021
work page 2021
-
[7]
WebRTC Voice Activity Detector,
Google, “WebRTC Voice Activity Detector,” WebRTC Project, 2011
work page 2011
-
[8]
Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,
J. Ramírez, J. M. Górriz, and J. C. Segura, “Voice Activity Detection. Fundamentals and Speech Recognition System Robustness,” inRobust Speech Recognition and Understanding, 2007
work page 2007
-
[9]
A statistical model-based voice activity detection,
J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,”IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999. A Feature Definitions Given a detection window ofW milliseconds andN speech segments{(si,e i)}N i=1 with durations di =e i−si: 14 speech_ratio= ∑N i=1di W (1) num_segments=N(2) mean_seg_ms= 1 N N∑ i=1 di (0ifN...
work page 1999
-
[10]
A cross-channel energy model (8 features from both bot and callee channels) was trained on 91 hand-labeled files, achieving 97.9% accuracy via leave-one-out cross-validation. 15
-
[11]
This model predicted labels for all 25,887 production recordings
-
[12]
Predictions were stratified into confidence tiers: •STRONG_VM:p(VM)>0.90(11,830 files, 81.6% verified accuracy) •STRONG_NVM:p(VM)<0.10(12,514 files, 98.2% verified accuracy) •MODERATE:0.10≤p≤0.90(180 files, excluded) •UNCERTAIN: Conflicting signals (1,067 files, excluded)
-
[13]
Only STRONG_VM and STRONG_NVM files were used for training (24,812 total). This pseudo-labeling approach achieves a weighted accuracy of∼90% on the training set, which is sufficient for training a model that generalizes to 99.3% on independently hand-labeled data. 16
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.