Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German
Pith reviewed 2026-05-25 05:54 UTC · model grok-4.3
The pith
ElevenLabs Scribe v2 records the lowest WER of 13.2% on code-switched Arabic, Persian, and German speech while BERTScore shows WER exaggerates gaps by 3x.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Commercial ASR systems differ substantially in handling natural code-switching, with ElevenLabs Scribe v2 achieving the lowest overall word error rate of 13.2% and the highest BERTScore of 0.936 across the four pairs; WER and BERTScore produce identical ordinal rankings for Arabic and Persian pairs yet WER inflates the magnitude of gaps by a factor of approximately three because it penalizes semantically correct transliteration choices.
What carries the argument
Dual-metric evaluation of WER against BERTScore on code-switched utterances, which isolates the effect of penalizing valid transliterations versus measuring semantic fidelity.
If this is right
- System rankings remain stable when switching from WER to BERTScore for the Arabic and Persian pairs.
- Aggregate averages mask larger performance gaps on harder code-switched subsets.
- BERT embeddings place reference and hypothesis texts close together even when scripts differ.
- Public release of the 1200-sample dataset allows direct replication and extension by other researchers.
Where Pith is reading between the lines
- Semantic metrics may become preferred over WER for ASR evaluation whenever transliteration or script variation is common.
- The observed 3x inflation suggests that current commercial systems could improve more than aggregate WER numbers indicate if training data included more code-switched examples.
- The difficulty-stratified results point to a need for benchmarks that report performance by utterance complexity rather than single overall scores.
Load-bearing premise
The two-stage pipeline of heuristic filtering followed by an ensemble of GPT-4o and Gemini 1.5 Pro produces a representative sample of natural code-switching utterances.
What would settle it
A new collection of 1200 utterances drawn from the same language pairs but selected entirely by human annotators without the LLM ensemble, on which ElevenLabs no longer shows the lowest WER or the same ranking order.
Figures
read the original abstract
Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($\tau = 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks five commercial ASR systems on code-switching speech for four language pairs (Egyptian Arabic–English, Saudi Arabic–English, Persian–English, German–English) using a 1200-utterance dataset (300 per pair) constructed via a two-stage heuristic-filter plus GPT-4o/Gemini 1.5 Pro ensemble pipeline that reduces LLM costs by ~91%. It reports ElevenLabs Scribe v2 as best overall (13.2% WER, 0.936 BERTScore), shows perfect rank agreement (τ=1.0) between WER and BERTScore on Arabic/Persian pairs, claims WER inflates quality gaps by ~3× by penalizing valid transliterations, provides difficulty-stratified analysis and embedding projections, and releases the dataset publicly.
Significance. If the dataset is representative, the work supplies a timely public benchmark on an under-studied ASR condition, demonstrates the practical divergence between surface (WER) and semantic (BERTScore) metrics, and supplies difficulty-stratified and embedding-based diagnostics. The public Hugging Face release is a clear reproducibility strength.
major comments (1)
- [Dataset construction pipeline (abstract and methods)] Dataset construction pipeline (abstract and methods): the two-stage pipeline is described only via its ~91% cost reduction; no human validation, inter-rater agreement, or comparison to existing CS corpora (SEAME, Bangor) is reported. This is load-bearing for the central claims, because the ordinal rankings, the 13.2%/0.936 headline numbers, and the 3× WER-inflation assertion all presuppose that the 1200 utterances constitute a faithful sample of natural code-switching.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive review. We address the single major comment below and will revise the manuscript accordingly to improve transparency.
read point-by-point responses
-
Referee: Dataset construction pipeline (abstract and methods): the two-stage pipeline is described only via its ~91% cost reduction; no human validation, inter-rater agreement, or comparison to existing CS corpora (SEAME, Bangor) is reported. This is load-bearing for the central claims, because the ordinal rankings, the 13.2%/0.936 headline numbers, and the 3× WER-inflation assertion all presuppose that the 1200 utterances constitute a faithful sample of natural code-switching.
Authors: We agree that the Methods section provides insufficient detail on the pipeline beyond the cost-reduction figure. In the revised manuscript we will expand this section to describe the specific heuristic filters applied in stage one, the exact ensemble scoring procedure and decision rules used with GPT-4o and Gemini 1.5 Pro in stage two, and the final selection criteria. We will also add a dedicated subsection comparing our dataset construction and language-pair coverage to existing corpora such as SEAME and Bangor. Human validation and inter-rater agreement statistics were not collected in the original study; we will state this limitation explicitly and discuss its implications for the representativeness claims. These additions will allow readers to evaluate the dataset more rigorously while preserving the reported results. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with no derivations or self-referential predictions
full rationale
This paper conducts an empirical comparison of five commercial ASR systems on a new code-switching dataset for four language pairs. It describes a two-stage data collection pipeline (heuristic filtering + LLM ensemble) and reports direct WER and BERTScore measurements, including ordinal rankings and a 3× inflation observation between metrics. No mathematical derivations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the abstract or described content. The central claims rest on observable performance numbers on the collected utterances rather than any reduction to the paper's own inputs or prior self-authored results. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Code-Switching in Conversation: Language, Interaction and Identity , year =
- [3]
- [4]
-
[5]
2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume=
Language boundary detection and identification of mixed-language speech based on map estimation , author=. 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume=. 2004 , organization=
work page 2004
-
[6]
Watanabe, Shinji and Hori, Takaaki and Kim, Suyoun and Hershey, John R. and Hayashi, Tomoki , title =. Proceedings of Interspeech , year =
-
[7]
Toshniwal, Shubham and Sainath, Tara N. and Weiss, Ron J. and Li, Bo and Moreno, Pedro and Weinstein, Eugene and Rao, Kanishka , title =. Proceedings of ICASSP , year =
-
[8]
Advances in Neural Information Processing Systems , volume =
SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset , author =. Advances in Neural Information Processing Systems , volume =
-
[9]
Proceedings of Interspeech , year =
Lyu, Dau-Cheng and Tan, Tien-Ping and Chng, Eng Siong and Li, Haizhou , title =. Proceedings of Interspeech , year =
-
[10]
Deuchar, Margaret and Davies, Peredur and Herring, Jon and Parafita Couto, Maria Carmen and Carter, Diana , title =. 2014 , note =
work page 2014
-
[11]
Proceedings of Interspeech , year =
Diwan, Anuj and others , title =. Proceedings of Interspeech , year =
-
[12]
Proceedings of the Arabic Natural Language Processing Workshop , year =
Hamed, Injy and Elmahdy, Mohamed and Abdennadher, Slim , title =. Proceedings of the Arabic Natural Language Processing Workshop , year =
-
[13]
A rz E n: A Speech Corpus for Code-switched E gyptian A rabic- E nglish
Hamed, Injy and Vu, Ngoc Thang and Abdennadher, Slim. A rz E n: A Speech Corpus for Code-switched E gyptian A rabic- E nglish. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020
work page 2020
-
[14]
Proceedings of DARPA Broadcast News Workshop , year =
Makhoul, John and Kubala, Francis and Schwartz, Richard and Weischedel, Ralph , title =. Proceedings of DARPA Broadcast News Workshop , year =
- [15]
-
[16]
BERTScore: Evaluating Text Generation with BERT
Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[17]
Proceedings of NAACL-HLT , year =
Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of NAACL-HLT , year =
-
[18]
Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. Proceedings of ICML , year =
-
[19]
Advances in Neural Information Processing Systems , year =
Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , title =. Advances in Neural Information Processing Systems , year =
-
[20]
Proceedings of EMNLP-IJCNLP , year =
Reimers, Nils and Gurevych, Iryna , title =. Proceedings of EMNLP-IJCNLP , year =
-
[21]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
McInnes, Leland and Healy, John and Melville, James , title =. arXiv preprint arXiv:1802.03426 , year =
work page internal anchor Pith review Pith/arXiv arXiv
- [22]
- [23]
- [24]
-
[25]
Introducing Next-Generation Audio Models in the. 2025 , howpublished =
work page 2025
-
[26]
Chirp 3 Transcription: Enhanced Multilingual Accuracy , year =
- [27]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.