Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques

Birger Kollmeier; Chen Xu; Lena Schell-Majoor

arxiv: 2512.04616 · v1 · submitted 2025-12-04 · 💻 cs.SD · physics.med-ph

Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques

Chen Xu , Lena Schell-Majoor , Birger Kollmeier This is my paper

Pith reviewed 2026-05-17 01:54 UTC · model grok-4.3

classification 💻 cs.SD physics.med-ph

keywords loudness scalingaudiogram classificationmachine learningBisgaard typesACALOSremote audiometrysupervised classification

0 comments

The pith

Machine learning models can predict standard Bisgaard audiogram types from calibration-independent loudness scaling data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adaptive categorical loudness scaling data, which needs no special calibration, can sort listeners into the six standard Bisgaard audiogram types. Researchers ran principal component analysis on responses from 847 people and trained seven supervised classifiers plus unsupervised and explainable methods. The first two principal components captured more than half the variance yet showed clear overlap across classes. Logistic regression gave the strongest classification results among the supervised models. A reader would care because the approach could support hearing assessments in remote or low-resource places that lack traditional audiometers.

Core claim

Supervised machine learning classifiers trained on ACALOS loudness scaling data can assign listeners to Bisgaard audiogram types with reasonable accuracy, although principal component analysis reveals substantial overlap between the classes in the data space.

What carries the argument

Supervised multi-class classifiers, with logistic regression performing best, applied to ACALOS loudness scaling features to assign the six Bisgaard audiogram classes.

If this is right

Calibration-free loudness data can approximate standard hearing loss classifications in remote settings.
Logistic regression offers a practical route for such predictions among tested classifiers.
Explainable methods can highlight which loudness features most influence the class assignments.
Unsupervised PCA helps visualize the data structure but does not cleanly separate the six classes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Smartphone apps running simple loudness scaling tests could provide preliminary hearing profiles without clinic equipment.
Adding other quick non-calibrated measures might improve separation where overlap currently limits performance.
Testing the same models on new listener groups outside the original database would check how well the mapping generalizes.

Load-bearing premise

The loudness scaling patterns contain enough distinguishing information to map reliably onto the six Bisgaard classes despite their overlap in the principal component map.

What would settle it

A new independent dataset of listeners with known Bisgaard types where classification accuracy falls well below usable levels, such as near or below 40 percent.

read the original abstract

To address the calibration and procedural challenges inherent in remote audiogram assessment for rehabilitative audiology, this study investigated whether calibration-independent adaptive categorical loudness scaling (ACALOS) data can be used to approximate individual audiograms by classifying listeners into standard Bisgaard audiogram types using machine learning. Three classes of machine learning approaches - unsupervised, supervised, and explainable - were evaluated. Principal component analysis (PCA) was performed to extract the first two principal components, which together explained more than 50 percent of the variance. Seven supervised multi-class classifiers were trained and compared, alongside unsupervised and explainable methods. Model development and evaluation used a large auditory reference database containing ACALOS data (N = 847). The PCA factor map showed substantial overlap between listeners, indicating that cleanly separating participants into six Bisgaard classes based solely on their loudness patterns is challenging. Nevertheless, the models demonstrated reasonable classification performance, with logistic regression achieving the highest accuracy among supervised approaches. These findings demonstrate that machine learning models can predict standard Bisgaard audiogram types, within certain limits, from calibration-independent loudness perception data, supporting potential applications in remote or resource-limited settings without requiring a traditional audiogram.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gets moderate classification accuracy on Bisgaard types from ACALOS loudness data using standard ML, but the acknowledged PCA overlap limits how reliable this can be for remote use.

read the letter

The main thing to know is that this work feeds calibration-independent ACALOS loudness scaling data into off-the-shelf unsupervised, supervised, and explainable ML to classify listeners into the six Bisgaard audiogram types. On a reference database of 847 cases they report reasonable performance with logistic regression on top, while noting that the first two PCA components explain over 50 percent of variance but show substantial class overlap.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates using calibration-independent adaptive categorical loudness scaling (ACALOS) data to classify listeners into one of six standard Bisgaard audiogram types. It applies unsupervised PCA (first two components explain >50% variance but with substantial class overlap), trains and compares seven supervised multi-class classifiers on a reference database (N=847), and incorporates explainable ML methods. Logistic regression achieves the highest accuracy; the authors conclude that the approach can approximate Bisgaard types within certain limits, supporting remote or resource-limited audiometry without traditional calibrated audiograms.

Significance. If the reported classification performance holds after addressing overlap and validation details, the work could enable practical remote hearing assessment by linking loudness perception patterns directly to standard audiogram shapes without specialized equipment. The combination of unsupervised, supervised, and explainable techniques provides a transparent framework that may generalize to other perceptual data in audiology.

major comments (2)

[PCA Results] PCA factor map (Results section): the first two principal components explain more than 50% of variance yet exhibit substantial overlap across the six Bisgaard classes. This overlap indicates that the dominant directions of variation in the ACALOS loudness data do not separate the target classes cleanly, raising the possibility that any supervised accuracy arises from residual dimensions, imbalance, or chance rather than robust, generalizable structure needed for the central claim.
[Supervised Classification Results] Supervised classification evaluation (Results section): logistic regression is identified as highest-accuracy, but the manuscript provides insufficient detail on cross-validation strategy, stratification for class imbalance, per-class metrics, or statistical comparison against chance-level or majority-class baselines. Given the acknowledged overlap, these omissions prevent assessment of whether performance is meaningful or load-bearing for the remote-audiogram approximation claim.

minor comments (2)

[Methods] Clarify the precise input feature representation (e.g., number of loudness scaling points or derived statistics) fed to the classifiers and any preprocessing steps applied to the N=847 database.
[Explainable Methods] Expand the explainable ML analysis with concrete examples or visualizations showing how specific loudness levels contribute to individual class predictions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have prompted us to clarify and strengthen several aspects of our analysis. We address each major comment in turn.

read point-by-point responses

Referee: [PCA Results] PCA factor map (Results section): the first two principal components explain more than 50% of variance yet exhibit substantial overlap across the six Bisgaard classes. This overlap indicates that the dominant directions of variation in the ACALOS loudness data do not separate the target classes cleanly, raising the possibility that any supervised accuracy arises from residual dimensions, imbalance, or chance rather than robust, generalizable structure needed for the central claim.

Authors: The manuscript already notes the substantial overlap in the PCA factor map and the resulting challenge in cleanly separating the six Bisgaard classes. The supervised models, however, operate on the complete set of ACALOS features rather than being restricted to the first two principal components. To address concerns about whether performance exceeds what could be expected from imbalance or chance, we have added comparisons to majority-class and chance-level baselines in the revised manuscript. We also provide additional discussion on the contribution of higher principal components to classification. revision: partial
Referee: [Supervised Classification Results] Supervised classification evaluation (Results section): logistic regression is identified as highest-accuracy, but the manuscript provides insufficient detail on cross-validation strategy, stratification for class imbalance, per-class metrics, or statistical comparison against chance-level or majority-class baselines. Given the acknowledged overlap, these omissions prevent assessment of whether performance is meaningful or load-bearing for the remote-audiogram approximation claim.

Authors: We agree that more methodological detail is needed for a full evaluation of the results. In the revised manuscript, we have expanded the description of the cross-validation procedure to specify the use of stratified k-fold cross-validation to account for class imbalance. We now report per-class metrics (precision, recall, F1-score) in addition to overall accuracy. Furthermore, we include statistical comparisons of model performance against chance-level and majority-class baselines, using McNemar's test or similar appropriate methods. These revisions should enable readers to assess the robustness of the findings despite the class overlap. revision: yes

Circularity Check

0 steps flagged

No circularity: standard empirical ML classification on external reference database

full rationale

The paper performs PCA for dimensionality reduction on ACALOS loudness scaling data, trains seven standard supervised multi-class classifiers (including logistic regression), and applies unsupervised/explainable methods to predict pre-existing Bisgaard audiogram classes from a large external reference database (N=847). Model evaluation uses held-out or cross-validated performance metrics on this independent data rather than any self-referential fit; no equations or steps reduce a claimed prediction to its own inputs by construction, no uniqueness theorems are imported via self-citation, and no ansatz or renaming of known results is presented as a derivation. The work is self-contained empirical modeling whose central claim rests on observable classification accuracy, not tautological re-expression of fitted parameters.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical correlation between ACALOS patterns and Bisgaard classes in the reference database; no explicit free parameters or invented entities are introduced beyond standard ML assumptions.

axioms (1)

domain assumption ACALOS data from the reference database is representative of the target population for remote assessment.
Invoked when generalizing classification performance to real-world remote settings.

pith-pipeline@v0.9.0 · 5519 in / 1058 out tokens · 48700 ms · 2026-05-17T01:54:22.138570+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The PCA factor map showed substantial overlap between listeners, indicating that cleanly separating participants into six Bisgaard classes based solely on their loudness patterns is challenging. Nevertheless, the models demonstrated reasonable classification performance, with logistic regression achieving the highest accuracy among supervised approaches.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

RQ1: To what extent is it feasible to use uncalibrated ACALOS data for estimating the corresponding Bisgaard class?

work page
[2]

RQ2: How can the statistical dependence between ACALOS data and the audiogram be quantified, and can machine learning methods be employed to systematically demonstrate this dependence?

work page
[3]

RQ3: What are the strengths and limitations of the three machine learning methods examined in this study? Materials and methods Description of the data set The data set used in this study was provided by Hörzentrum Oldenburg gGmbH. It is a superset of the publicly available Oldenburg Hearing Health Repository (OHHR; Jafri et al., 2024) and comprises a com...

work page 2024
[4]

For this study, we extracted a subset of the larger database, focusing on the audiogram and loudness scaling data, only

have demonstrated that age is a significant predictor of hearing loss, we excluded demographic data from our analysis, as the focus of this study was to predict the standard audiograms based solely on loudness measures. For this study, we extracted a subset of the larger database, focusing on the audiogram and loudness scaling data, only. The audiogram da...

work page 2002
[5]

In contrast, the variables mhigh and mlow were left unchanged, as they represent relative quantities and are thus independent of calibration

and a standard deviation of 5 dB was applied to the features L2.5, L25, L50, and LCUT (please note that both ears of each participant had the same offset, as these parameters are likely influenced by device calibration). In contrast, the variables mhigh and mlow were left unchanged, as they represent relative quantities and are thus independent of calibra...

work page 2017
[6]

However, PCA projections were insufficient for classifying six groups, underscoring the complexity of the data and the limits of unsupervised clustering for this task

Unsupervised learning (PCA) offered insights into the structure of the feature space, highlighting which features drove variability and revealing clusters of feature types. However, PCA projections were insufficient for classifying six groups, underscoring the complexity of the data and the limits of unsupervised clustering for this task

work page
[7]

Supervised classifiers directly addressed the prediction of audiograms. Logistic Regression emerged as the most effective model, consistent with literature showing that linear models often outperform complex ones in small-to-moderate datasets (<20k records; Kass, 2019). In contrast, KNN struggled with the curse of dimensionality, confirming that not all m...

work page 2019
[8]

Calibration offset estimation in mobile hearing tests via categorical loudness scaling

Explainable ML methods bridged predictive modeling with audiological interpretation, identifying the most and least relevant features. This step is crucial for clinical translation, as it clarifies why models make certain predictions and builds trust among practitioners. Overall, each ML paradigm contributes differently: unsupervised methods aid feature i...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.14808650 2023

[1] [1]

RQ1: To what extent is it feasible to use uncalibrated ACALOS data for estimating the corresponding Bisgaard class?

work page

[2] [2]

RQ2: How can the statistical dependence between ACALOS data and the audiogram be quantified, and can machine learning methods be employed to systematically demonstrate this dependence?

work page

[3] [3]

RQ3: What are the strengths and limitations of the three machine learning methods examined in this study? Materials and methods Description of the data set The data set used in this study was provided by Hörzentrum Oldenburg gGmbH. It is a superset of the publicly available Oldenburg Hearing Health Repository (OHHR; Jafri et al., 2024) and comprises a com...

work page 2024

[4] [4]

For this study, we extracted a subset of the larger database, focusing on the audiogram and loudness scaling data, only

have demonstrated that age is a significant predictor of hearing loss, we excluded demographic data from our analysis, as the focus of this study was to predict the standard audiograms based solely on loudness measures. For this study, we extracted a subset of the larger database, focusing on the audiogram and loudness scaling data, only. The audiogram da...

work page 2002

[5] [5]

In contrast, the variables mhigh and mlow were left unchanged, as they represent relative quantities and are thus independent of calibration

and a standard deviation of 5 dB was applied to the features L2.5, L25, L50, and LCUT (please note that both ears of each participant had the same offset, as these parameters are likely influenced by device calibration). In contrast, the variables mhigh and mlow were left unchanged, as they represent relative quantities and are thus independent of calibra...

work page 2017

[6] [6]

However, PCA projections were insufficient for classifying six groups, underscoring the complexity of the data and the limits of unsupervised clustering for this task

Unsupervised learning (PCA) offered insights into the structure of the feature space, highlighting which features drove variability and revealing clusters of feature types. However, PCA projections were insufficient for classifying six groups, underscoring the complexity of the data and the limits of unsupervised clustering for this task

work page

[7] [7]

Supervised classifiers directly addressed the prediction of audiograms. Logistic Regression emerged as the most effective model, consistent with literature showing that linear models often outperform complex ones in small-to-moderate datasets (<20k records; Kass, 2019). In contrast, KNN struggled with the curse of dimensionality, confirming that not all m...

work page 2019

[8] [8]

Calibration offset estimation in mobile hearing tests via categorical loudness scaling

Explainable ML methods bridged predictive modeling with audiological interpretation, identifying the most and least relevant features. This step is crucial for clinical translation, as it clarifies why models make certain predictions and builds trust among practitioners. Overall, each ML paradigm contributes differently: unsupervised methods aid feature i...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.14808650 2023