pith. sign in

arxiv: 2605.19069 · v3 · pith:BZY4GCE4new · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

Pith reviewed 2026-05-25 05:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords code-switchingautomatic speech recognitionASRword error rateBERTScoreArabic-EnglishPersian-EnglishGerman-English
0
0 comments X

The pith

ElevenLabs Scribe v2 records the lowest WER of 13.2% on code-switched Arabic, Persian, and German speech while BERTScore shows WER exaggerates gaps by 3x.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates five commercial ASR systems on code-switched speech across four language pairs using 300 curated samples each. It reports that ElevenLabs Scribe v2 leads on both WER and BERTScore, with the two metrics agreeing on system order but differing sharply on the size of performance differences. The work releases the full dataset and shows that surface-level script mismatches in transliterations drive much of the WER penalty even when meaning is preserved. Difficulty-stratified breakdowns reveal performance variation hidden by overall averages. Semantic embedding projections confirm that references and hypotheses remain close despite surface differences.

Core claim

Commercial ASR systems differ substantially in handling natural code-switching, with ElevenLabs Scribe v2 achieving the lowest overall word error rate of 13.2% and the highest BERTScore of 0.936 across the four pairs; WER and BERTScore produce identical ordinal rankings for Arabic and Persian pairs yet WER inflates the magnitude of gaps by a factor of approximately three because it penalizes semantically correct transliteration choices.

What carries the argument

Dual-metric evaluation of WER against BERTScore on code-switched utterances, which isolates the effect of penalizing valid transliterations versus measuring semantic fidelity.

If this is right

  • System rankings remain stable when switching from WER to BERTScore for the Arabic and Persian pairs.
  • Aggregate averages mask larger performance gaps on harder code-switched subsets.
  • BERT embeddings place reference and hypothesis texts close together even when scripts differ.
  • Public release of the 1200-sample dataset allows direct replication and extension by other researchers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Semantic metrics may become preferred over WER for ASR evaluation whenever transliteration or script variation is common.
  • The observed 3x inflation suggests that current commercial systems could improve more than aggregate WER numbers indicate if training data included more code-switched examples.
  • The difficulty-stratified results point to a need for benchmarks that report performance by utterance complexity rather than single overall scores.

Load-bearing premise

The two-stage pipeline of heuristic filtering followed by an ensemble of GPT-4o and Gemini 1.5 Pro produces a representative sample of natural code-switching utterances.

What would settle it

A new collection of 1200 utterances drawn from the same language pairs but selected entirely by human annotators without the LLM ensemble, on which ElevenLabs no longer shows the lowest WER or the same ranking order.

Figures

Figures reproduced from arXiv: 2605.19069 by Ahmad ElShiekh, Ahmed Rashad, Clayton W. Taylor, Ghassan Al-Sumaidaee, Sajjad Abdoli.

Figure 1
Figure 1. Figure 1: Topic distribution of the 300 benchmark samples per language pair, classified by GPT-4o using an inductively derived taxonomy. Topics reflect the actual semantic domains of the code-switching speech corpora. Scoring signals. Each transcript receives a composite H_Score in [0, 10] from five weighted signals. Script mix ratio (w = 0.30). Let na be the count of Arabic/Persian Unicode characters and nl the cou… view at source ↗
Figure 2
Figure 2. Figure 2: Mean WER (%) for all five systems across all four language pairs. Each group of bars represents one language pair; bars within a group are colour-coded by system. Deepgram results shown for German only (no Arabic/Persian CS support). Lower is better. v2 is the most consistent system: it achieves the lowest WER for all four language pairs and remains competitive or leading on BERTScore. The Arabic pairs are… view at source ↗
Figure 3
Figure 3. Figure 3: Mean BERTScore F1 for all five systems across all four language pairs (bert-base-multilingual-cased). BERTScore rewards semantically correct transliteration that WER would penalise. Higher is better. Appendix D provides the full qualitative side-by-side transcription comparison for the highest-divergence examples across all language pairs, including [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UMAP projection of multilingual sentence embeddings for 80 Persian–English utterance pairs. Blue = reference transcripts; orange = ElevenLabs Scribe v2 hypotheses. Each grey line connects a reference–hypothesis pair. The tight clustering of connected pairs — short line lengths — confirms semantic proximity despite surface-level script differences (e.g., feature vs. ݠ۫ڣچ(, consistent with the high Persian–E… view at source ↗
read the original abstract

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($\tau = 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper benchmarks five commercial ASR systems on code-switching speech for four language pairs (Egyptian Arabic–English, Saudi Arabic–English, Persian–English, German–English) using a 1200-utterance dataset (300 per pair) constructed via a two-stage heuristic-filter plus GPT-4o/Gemini 1.5 Pro ensemble pipeline that reduces LLM costs by ~91%. It reports ElevenLabs Scribe v2 as best overall (13.2% WER, 0.936 BERTScore), shows perfect rank agreement (τ=1.0) between WER and BERTScore on Arabic/Persian pairs, claims WER inflates quality gaps by ~3× by penalizing valid transliterations, provides difficulty-stratified analysis and embedding projections, and releases the dataset publicly.

Significance. If the dataset is representative, the work supplies a timely public benchmark on an under-studied ASR condition, demonstrates the practical divergence between surface (WER) and semantic (BERTScore) metrics, and supplies difficulty-stratified and embedding-based diagnostics. The public Hugging Face release is a clear reproducibility strength.

major comments (1)
  1. [Dataset construction pipeline (abstract and methods)] Dataset construction pipeline (abstract and methods): the two-stage pipeline is described only via its ~91% cost reduction; no human validation, inter-rater agreement, or comparison to existing CS corpora (SEAME, Bangor) is reported. This is load-bearing for the central claims, because the ordinal rankings, the 13.2%/0.936 headline numbers, and the 3× WER-inflation assertion all presuppose that the 1200 utterances constitute a faithful sample of natural code-switching.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive review. We address the single major comment below and will revise the manuscript accordingly to improve transparency.

read point-by-point responses
  1. Referee: Dataset construction pipeline (abstract and methods): the two-stage pipeline is described only via its ~91% cost reduction; no human validation, inter-rater agreement, or comparison to existing CS corpora (SEAME, Bangor) is reported. This is load-bearing for the central claims, because the ordinal rankings, the 13.2%/0.936 headline numbers, and the 3× WER-inflation assertion all presuppose that the 1200 utterances constitute a faithful sample of natural code-switching.

    Authors: We agree that the Methods section provides insufficient detail on the pipeline beyond the cost-reduction figure. In the revised manuscript we will expand this section to describe the specific heuristic filters applied in stage one, the exact ensemble scoring procedure and decision rules used with GPT-4o and Gemini 1.5 Pro in stage two, and the final selection criteria. We will also add a dedicated subsection comparing our dataset construction and language-pair coverage to existing corpora such as SEAME and Bangor. Human validation and inter-rater agreement statistics were not collected in the original study; we will state this limitation explicitly and discuss its implications for the representativeness claims. These additions will allow readers to evaluate the dataset more rigorously while preserving the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with no derivations or self-referential predictions

full rationale

This paper conducts an empirical comparison of five commercial ASR systems on a new code-switching dataset for four language pairs. It describes a two-stage data collection pipeline (heuristic filtering + LLM ensemble) and reports direct WER and BERTScore measurements, including ordinal rankings and a 3× inflation observation between metrics. No mathematical derivations, fitted parameters renamed as predictions, self-citation chains, or uniqueness theorems appear in the abstract or described content. The central claims rest on observable performance numbers on the collected utterances rather than any reduction to the paper's own inputs or prior self-authored results. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; no mathematical model, free parameters, axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5782 in / 1077 out tokens · 24702 ms · 2026-05-25T05:54:40.560737+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 2 internal anchors

  1. [1]

    1997 , publisher =

    Myers-Scotton, Carol , title =. 1997 , publisher =

  2. [2]

    Code-Switching in Conversation: Language, Interaction and Identity , year =

  3. [3]

    , title =

    Gumperz, John J. , title =. 1982 , publisher =

  4. [4]

    Linguistics , volume =

    Poplack, Shana , title =. Linguistics , volume =

  5. [5]

    2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume=

    Language boundary detection and identification of mixed-language speech based on map estimation , author=. 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing , volume=. 2004 , organization=

  6. [6]

    and Hayashi, Tomoki , title =

    Watanabe, Shinji and Hori, Takaaki and Kim, Suyoun and Hershey, John R. and Hayashi, Tomoki , title =. Proceedings of Interspeech , year =

  7. [7]

    and Weiss, Ron J

    Toshniwal, Shubham and Sainath, Tara N. and Weiss, Ron J. and Li, Bo and Moreno, Pedro and Weinstein, Eugene and Rao, Kanishka , title =. Proceedings of ICASSP , year =

  8. [8]

    Advances in Neural Information Processing Systems , volume =

    SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset , author =. Advances in Neural Information Processing Systems , volume =

  9. [9]

    Proceedings of Interspeech , year =

    Lyu, Dau-Cheng and Tan, Tien-Ping and Chng, Eng Siong and Li, Haizhou , title =. Proceedings of Interspeech , year =

  10. [10]

    2014 , note =

    Deuchar, Margaret and Davies, Peredur and Herring, Jon and Parafita Couto, Maria Carmen and Carter, Diana , title =. 2014 , note =

  11. [11]

    Proceedings of Interspeech , year =

    Diwan, Anuj and others , title =. Proceedings of Interspeech , year =

  12. [12]

    Proceedings of the Arabic Natural Language Processing Workshop , year =

    Hamed, Injy and Elmahdy, Mohamed and Abdennadher, Slim , title =. Proceedings of the Arabic Natural Language Processing Workshop , year =

  13. [13]

    A rz E n: A Speech Corpus for Code-switched E gyptian A rabic- E nglish

    Hamed, Injy and Vu, Ngoc Thang and Abdennadher, Slim. A rz E n: A Speech Corpus for Code-switched E gyptian A rabic- E nglish. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  14. [14]

    Proceedings of DARPA Broadcast News Workshop , year =

    Makhoul, John and Kubala, Francis and Schwartz, Richard and Weischedel, Ralph , title =. Proceedings of DARPA Broadcast News Workshop , year =

  15. [15]

    , title =

    Levenshtein, Vladimir I. , title =. Soviet Physics Doklady , volume =

  16. [16]

    BERTScore: Evaluating Text Generation with BERT

    Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

  17. [17]

    Proceedings of NAACL-HLT , year =

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , title =. Proceedings of NAACL-HLT , year =

  18. [18]

    Proceedings of ICML , year =

    Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. Proceedings of ICML , year =

  19. [19]

    Advances in Neural Information Processing Systems , year =

    Baevski, Alexei and Zhou, Yuhao and Mohamed, Abdelrahman and Auli, Michael , title =. Advances in Neural Information Processing Systems , year =

  20. [20]

    Proceedings of EMNLP-IJCNLP , year =

    Reimers, Nils and Gurevych, Iryna , title =. Proceedings of EMNLP-IJCNLP , year =

  21. [21]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    McInnes, Leland and Healy, John and Melville, James , title =. arXiv preprint arXiv:1802.03426 , year =

  22. [22]

    2024 , howpublished =

    Implement language identification ---. 2024 , howpublished =

  23. [23]

    2025 , howpublished =

    Nova-3 Multilingual Goes. 2025 , howpublished =

  24. [24]

    2025 , howpublished =

    Introducing. 2025 , howpublished =

  25. [25]

    2025 , howpublished =

    Introducing Next-Generation Audio Models in the. 2025 , howpublished =

  26. [26]

    Chirp 3 Transcription: Enhanced Multilingual Accuracy , year =

  27. [27]

    2025 , howpublished =

    Models:. 2025 , howpublished =