arxiv: 2604.10503 · v1 · submitted 2026-04-12 · 💻 cs.SD · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

Shivam Chauhan , Ajay Pundhir

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords mel-scale biascross-cultural audiospeech recognitionmusic analysisaudio front-endsfair audio processingLEAFCQT

0 comments

The pith

Mel-scale audio features produce 12.5 percent higher error rates on tonal languages and 15.7 percent lower scores on non-Western music, while alternatives shrink those gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the mel-scale, a frequency representation rooted in 1940s Western listening studies, creates uneven performance in modern audio systems. It runs controlled trials on speech from 11 languages and music from 6 collections, swapping only the front-end while holding model architecture fixed. Mel-scale shows clear drops for tonal languages and non-Western music. Learnable scales and other psychoacoustic options close much of the difference with little added cost. The work releases a benchmark to let others measure the same effect.

Core claim

Mel-scale representations yield 31.2 percent word error rate on tonal languages versus 18.7 percent on non-tonal ones, plus a 15.7 percent F1 drop between Western and non-Western music. LEAF cuts the speech gap by 34 percent, CQT reduces the music gap by 52 percent, and ERB-scale filtering narrows disparities by 31 percent at roughly 1 percent extra computation.

What carries the argument

Controlled swaps of audio front-ends (mel-scale versus LEAF, SincNet, ERB, Bark, CQT) with all other model and training elements held constant, measured across speech, music, and scene tasks.

If this is right

Adaptive frequency allocation in LEAF improves recognition for tonal languages without redesigning the rest of the system.
Constant-Q transform better captures pitch relations across Western and non-Western music collections.
ERB-scale filtering delivers comparable fairness gains at almost no extra cost.
Releasing FairAudioBench makes systematic cross-cultural testing of new front-ends straightforward.
Foundational signal-processing choices can propagate performance differences across languages and music traditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Audio pipelines aimed at global use may need to default to one of the lower-bias front-ends rather than mel-scale.
Similar front-end swaps could be tested in other domains such as environmental sound or bioacoustics.
The results point toward collecting more non-Western training data as a complementary fix alongside scale changes.
Developers could run quick ablation tests with ERB or CQT on their existing models to check for hidden gaps.

Load-bearing premise

Changing only the frequency scale while keeping architecture and training data fixed fully isolates bias from the scale itself rather than from data composition or language acoustics.

What would settle it

An experiment that retrains the identical models on culturally balanced data and finds the same performance gaps with mel-scale would show the front-end is not the main driver.

read the original abstract

Modern audio systems universally employ mel-scale representations derived from 1940s Western psychoacoustic studies, potentially encoding cultural biases that create systematic performance disparities. We present a comprehensive evaluation of cross-cultural bias in audio front-ends, comparing mel-scale features with learnable alternatives (LEAF, SincNet) and psychoacoustic variants (ERB, Bark, CQT) across speech recognition (11 languages), music analysis (6 collections), and European acoustic scene classification (10 European cities). Our controlled experiments isolate front-end contributions while holding architecture and training protocols minimal and constant. Results demonstrate that mel-scale features yield 31.2% WER for tonal languages compared to 18.7% for non-tonal languages (12.5% gap), and show 15.7% F1 degradation between Western and non-Western music. Alternative representations significantly reduce these disparities: LEAF reduces the speech gap by 34% through adaptive frequency allocation, CQT achieves 52% reduction in music performance gaps, and ERB-scale filtering cuts disparities by 31% with only 1% computational overhead. We also release FairAudioBench, enabling cross-cultural evaluation, and demonstrate that adaptive frequency decomposition offers practical paths toward equitable audio processing. These findings reveal how foundational signal processing choices propagate bias, providing crucial guidance for developing inclusive audio systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures concrete performance gaps on tonal languages and non-Western music that shrink under LEAF/CQT/ERB swaps, but the claim that this isolates mel-scale cultural bias is undercut by uncontrolled differences in training data acoustics.

read the letter

The main thing to know is that the authors ran front-end swaps on ASR for 11 languages and music tasks for 6 collections, holding model size and training fixed, and report a 12.5% WER gap between tonal and non-tonal languages plus a 15.7% F1 gap between Western and non-Western music. LEAF cuts the speech gap by 34%, CQT cuts the music gap by 52%, and ERB does it with almost no extra cost. They also release FairAudioBench for others to use. That empirical comparison is the actual new piece; the alternatives themselves are not novel but the scale of the cross-cultural test is useful data. They do a decent job keeping the experimental setup minimal so the front-end is the main variable. The numbers are specific enough to be checkable. The soft spot is the isolation step. Tonal-language data has different pitch distributions and phoneme sets than non-tonal data, so a front-end that resolves pitch more evenly will produce different gradients even if the optimizer and architecture stay the same. The paper says protocols are minimal and constant, but without matched subset sizes, cross-lingual training, or an ablation on data composition, the gap reductions could just reflect better acoustic alignment rather than removal of a Western bias baked into mel. The stress-test note is right on this and the abstract does not close it. This is for people who build or deploy audio models and care about language and music equity. It is worth a serious referee because the measurements are concrete and the concern is practical, even if the causal story needs tightening on methods and data details. Send it to review with a request for those controls.

Referee Report

1 major / 2 minor

Summary. The paper claims that mel-scale representations, rooted in 1940s Western psychoacoustics, encode cross-cultural biases that produce measurable performance gaps: 12.5% WER disparity between tonal and non-tonal languages in ASR, and 15.7% F1 drop between Western and non-Western music. Controlled experiments that hold model architecture and training protocols fixed while swapping only the front-end show that learnable (LEAF, SincNet) and alternative psychoacoustic (ERB, Bark, CQT) representations reduce these gaps by 31–52%, with ERB incurring only 1% extra compute. The authors release FairAudioBench to support further cross-cultural evaluation and conclude that adaptive frequency decompositions offer a practical route to more equitable audio systems.

Significance. If the isolation of front-end effects is convincingly demonstrated, the work would provide concrete empirical evidence that a foundational signal-processing choice propagates cultural bias across speech and music tasks, together with low-overhead remedies. The release of FairAudioBench is a clear positive contribution that enables reproducible follow-up studies. The multi-domain scope (11 languages, 6 music collections, 10 cities) strengthens the generality of the findings relative to single-task studies.

major comments (1)

§3 (Experimental Protocol): The central claim that performance gaps are attributable to cultural bias in the mel scale rather than to mismatches between front-end and language-specific acoustic distributions rests on the assertion that 'architecture and training protocols [are] minimal and constant.' However, tonal-language ASR is trained on tonal corpora and non-tonal on non-tonal corpora; these differ in phoneme inventories, pitch distributions, and recording conditions. Without reported cross-lingual training, matched subset sizes, or an ablation that holds the training data fixed while varying only the front-end, the 34% gap reduction by LEAF and 52% reduction by CQT could arise from better alignment with the target acoustics rather than removal of Western bias. This directly affects the interpretation of the 12.5% WER and 15.7% F1 gaps as evidence of scale-induced bias.

minor comments (2)

Abstract and §4: the reported WER and F1 figures are given as point estimates without standard deviations, confidence intervals, or the number of runs; adding these would allow readers to assess whether the gap reductions are statistically reliable.
§5: the description of FairAudioBench would benefit from an explicit statement of which datasets are included, their language/cultural coverage, and any licensing or access restrictions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The positive assessment of the work's significance and the release of FairAudioBench is appreciated. We address the single major comment on the experimental protocol below, providing the strongest honest defense of our controls while acknowledging where clarification is needed. A partial revision will be made to improve transparency.

read point-by-point responses

Referee: §3 (Experimental Protocol): The central claim that performance gaps are attributable to cultural bias in the mel scale rather than to mismatches between front-end and language-specific acoustic distributions rests on the assertion that 'architecture and training protocols [are] minimal and constant.' However, tonal-language ASR is trained on tonal corpora and non-tonal on non-tonal corpora; these differ in phoneme inventories, pitch distributions, and recording conditions. Without reported cross-lingual training, matched subset sizes, or an ablation that holds the training data fixed while varying only the front-end, the 34% gap reduction by LEAF and 52% reduction by CQT could arise from better alignment with the target acoustics rather than removal of Western bias. This directly affects the interpretation of the 12.5% WER and 15.7% F1 gaps as evidence of scale-induced bias.

Authors: We thank the referee for this important observation on potential confounds. Our design fixes the neural architecture, optimizer, learning-rate schedule, epoch count, and data splits for each task while varying only the front-end; this isolates the frequency decomposition's effect within each corpus. The 12.5% WER gap appears specifically with mel-scale on tonal data and shrinks substantially with both learnable (LEAF) and alternative fixed scales (ERB, Bark, CQT), consistent with the music-domain results where non-Western collections exhibit analogous degradations. This pattern indicates that the Western-derived mel warping is suboptimal for the pitch and spectral characteristics of the target distributions, rather than a generic data mismatch. We agree that cross-lingual training or a fully matched-subset ablation would further strengthen causal attribution; such experiments lie outside the current benchmark's scope given available corpora. We will revise §3 to explicitly discuss this limitation, report the within-corpus controls more prominently, and note that the multi-domain consistency (speech + music) supports the scale-bias interpretation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical front-end comparisons are self-contained

full rationale

The paper reports measured WER and F1 gaps from direct experiments that swap only the audio front-end while holding model architecture and training protocols fixed. No equations, derivations, or first-principles claims appear in the provided text; the central results (12.5% WER gap, 15.7% F1 gap, and their reductions under LEAF/CQT/ERB) are obtained by running the same downstream tasks on the same data splits with different input representations. No self-citations are invoked to justify uniqueness or to close a derivation loop, and no fitted parameters are renamed as predictions. The skeptic concern about data-distribution interactions is a methodological limitation, not a circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that observed performance differences can be attributed primarily to the choice of frequency scale once architecture and training are held fixed, plus the implicit assumption that the chosen datasets adequately represent cultural variation.

axioms (1)

domain assumption Mel-scale representations derived from 1940s Western psychoacoustic studies may encode cultural biases that affect downstream task performance.
Invoked in the opening motivation and used to interpret the observed gaps.

invented entities (1)

FairAudioBench no independent evidence
purpose: Benchmark dataset and protocol for cross-cultural audio evaluation
New artifact introduced to enable replication and further testing of the reported disparities.

pith-pipeline@v0.9.0 · 5540 in / 1430 out tokens · 69204 ms · 2026-05-10T16:11:38.555211+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

mel-scale representations derived from 1940s Western psychoacoustic studies... ψmel(f) = 2595 log10(1 + f/700)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Information Bottleneck Bound)... E ≥ ∫ I(f)·I[R(f)>Δf_min(f)] p(f) df

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages · 1 internal anchor

[1]

INTRODUCTION Audio systems deployed for billions of users universally employ mel-scale representations derived from 1940s West- ern psychoacoustic studies [1]. This seemingly technical choice has profound consequences: recent studies document 2x higher word error rates for African American speakers across major ASR platforms [2], while non-Western musi- c...
[2]

Cross-Cultural Bias in Mel-Scale Representations: Evidence and Alternatives from Speech and Music

MEASURING CROSS-CULTURAL BIAS IN AUDIO FRONT-ENDS 2.1. Problem Formulation We hypothesize that mel-scale representations create system- atic disadvantages for non-Western users, particularly the 2 billion speakers of tonal languages, where pitch variations distinguish word meanings. To quantify this bias, we compare arXiv:2604.10503v1 [cs.SD] 12 Apr 2026 ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

EXPERIMENTS AND RESULTS 3.1. Datasets We evaluate across three complementary domains using care- fully balanced data: Speech Recognition:CommonV oicev17.0 [17] with 11 languages.Tonal languages(5): Mandarin Chinese (4 tones), Vietnamese (6 tones), Thai (5 tones), Punjabi (3 tones), Can- tonese (6 tones).Non-tonal languages(6): English, Spanish, German, Fr...

2020
[4]

The mel scale, derived from 1940s Western studies, was never validated cross-culturally

CONCLUSION Our findings challenge assumptions of universal psychoa- coustic models. The mel scale, derived from 1940s Western studies, was never validated cross-culturally. As audio AI be- comes a global infrastructure, embedding such assumptions constitutes technological bias at scale. Simple alternatives ex- ist today: production systems could deploy ER...
[5]

The relation of pitch to frequency: A revised scale,

Stanley S Stevens and John V olkmann, “The relation of pitch to frequency: A revised scale,”The Ameri- can Journal of Psychology, vol. 53, no. 3, pp. 329–353, 1940

1940
[6]

Racial disparities in automated speech recognition,

Allison Koenecke, Andrew Nam, Emily Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, Connor Toups, John R Rickford, Dan Jurafsky, and Sharad Goel, “Racial disparities in automated speech recognition,” Proceedings of the national academy of sciences, vol. 117, no. 14, pp. 7684–7689, 2020

2020
[7]

Music for all: Representational bias and cross-cultural adaptability of music generation models,

Atharva Mehta, Shivam Chauhan, Amirbek Djanibekov, Atharva Kulkarni, Gus Xia, and Monojit Choudhury, “Music for all: Representational bias and cross-cultural adaptability of music generation models,”arXiv preprint arXiv:2502.07328, 2025

work page arXiv 2025
[8]

Moira Jean Winsland Yip,Tone, Cambridge University Press, 2002

2002
[9]

18, Cambridge University Press, 1976

John Marshall Howie,Acoustical studies of Mandarin vowels and tones, vol. 18, Cambridge University Press, 1976

1976
[10]

The world at- las of language structures online,

Matthew Dryer and Martin Haspelmath, “The world at- las of language structures online,” Dec. 2022

2022
[11]

Missing melodies: Ai music generation and its

Atharva Mehta, Shivam Chauhan, and Monojit Choud- hury, “Missing melodies: Ai music generation and its” nearly” complete omission of the global south,”arXiv preprint arXiv:2412.04100, 2024

work page arXiv 2024
[12]

ArXiv abs/2101.08596

Neil Zeghidour, Olivier Teboul, F ´elix De Chaumont Quitry, and Marco Tagliasacchi, “Leaf: A learn- able frontend for audio classification,”arXiv preprint arXiv:2101.08596, 2021

work page arXiv 2021
[13]

Suggested for- mulae for calculating auditory-filter bandwidths and ex- citation patterns.,

Brian C Moore and Brian R Glasberg, “Suggested for- mulae for calculating auditory-filter bandwidths and ex- citation patterns.,”The journal of the acoustical society of America, vol. 74, no. 3, pp. 750–753, 1983

1983
[14]

Subdivision of the audible fre- quency range into critical bands (frequenzgruppen),

Eberhard Zwicker, “Subdivision of the audible fre- quency range into critical bands (frequenzgruppen),” The Journal of the Acoustical Society of America, vol. 33, no. 2, pp. 248–248, 1961

1961
[15]

Calculation of a constant q spectral transform,

Judith C Brown, “Calculation of a constant q spectral transform,”The Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991

1991
[16]

Towards measuring fairness in speech recognition: Fair-speech dataset,

Irina-Elena Veliche, Zhuangqun Huang, Vineeth Ayyat Kochaniyan, Fuchun Peng, Ozlem Kalinli, and Michael L Seltzer, “Towards measuring fairness in speech recognition: Fair-speech dataset,”arXiv preprint arXiv:2408.12734, 2024

work page arXiv 2024
[17]

The computational study of a musical culture through its digital traces,

Xavier Serra, “The computational study of a musical culture through its digital traces,”Acta Musicologica, vol. 89, no. 1, pp. 24–44, 2017

2017
[18]

Solon Barocas, Moritz Hardt, and Arvind Narayanan, Fairness and machine learning: Limitations and oppor- tunities, MIT press, 2023

2023
[19]

Equality of opportunity in supervised learning,

Moritz Hardt, Eric Price, and Nati Srebro, “Equality of opportunity in supervised learning,”Advances in neural information processing systems, vol. 29, 2016

2016
[20]

Thomas M Cover,Elements of information theory, John Wiley & Sons, 1999

1999
[21]

Common voice: A massively- multilingual speech corpus,

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber, “Common voice: A massively- multilingual speech corpus,” inProceedings of the 12th Conference on Language Resources and Evalua- tion (LREC 2020), 2020, pp. 4211–4215

2020
[22]

Musical genre classification of audio signals,

George Tzanetakis and Perry Cook, “Musical genre classification of audio signals,”IEEE Transactions on speech and audio processing, vol. 10, no. 5, pp. 293– 302, 2002

2002
[23]

FMA: A Dataset For Music Analysis

Micha ¨el Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson, “Fma: A dataset for music analy- sis,”arXiv preprint arXiv:1612.01840, 2016

work page Pith review arXiv 2016
[24]

Tau urban acoustic scenes 2020 mobile, develop- ment dataset,

Toni Heittola, Annamaria Mesaros, and Tuomas Virta- nen, “Tau urban acoustic scenes 2020 mobile, develop- ment dataset,” Feb. 2020

2020
[25]

Derivation of auditory filter shapes from notched-noise data,

Brian R Glasberg and Brian CJ Moore, “Derivation of auditory filter shapes from notched-noise data,”Hearing research, vol. 47, no. 1-2, pp. 103–138, 1990

1990
[26]

Speaker recogni- tion from raw waveform with sincnet,

Mirco Ravanelli and Yoshua Bengio, “Speaker recogni- tion from raw waveform with sincnet,” in2018 IEEE spoken language technology workshop (SLT). IEEE, 2018, pp. 1021–1028

2018
[27]

Trainable frontend for robust and far-field keyword spotting,

Yuxuan Wang, Pascal Getreuer, Thad Hughes, Richard F Lyon, and Rif A Saurous, “Trainable frontend for robust and far-field keyword spotting,” in2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 5670–5674

2017