Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques
Pith reviewed 2026-05-17 01:54 UTC · model grok-4.3
The pith
Machine learning models can predict standard Bisgaard audiogram types from calibration-independent loudness scaling data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Supervised machine learning classifiers trained on ACALOS loudness scaling data can assign listeners to Bisgaard audiogram types with reasonable accuracy, although principal component analysis reveals substantial overlap between the classes in the data space.
What carries the argument
Supervised multi-class classifiers, with logistic regression performing best, applied to ACALOS loudness scaling features to assign the six Bisgaard audiogram classes.
If this is right
- Calibration-free loudness data can approximate standard hearing loss classifications in remote settings.
- Logistic regression offers a practical route for such predictions among tested classifiers.
- Explainable methods can highlight which loudness features most influence the class assignments.
- Unsupervised PCA helps visualize the data structure but does not cleanly separate the six classes.
Where Pith is reading between the lines
- Smartphone apps running simple loudness scaling tests could provide preliminary hearing profiles without clinic equipment.
- Adding other quick non-calibrated measures might improve separation where overlap currently limits performance.
- Testing the same models on new listener groups outside the original database would check how well the mapping generalizes.
Load-bearing premise
The loudness scaling patterns contain enough distinguishing information to map reliably onto the six Bisgaard classes despite their overlap in the principal component map.
What would settle it
A new independent dataset of listeners with known Bisgaard types where classification accuracy falls well below usable levels, such as near or below 40 percent.
read the original abstract
To address the calibration and procedural challenges inherent in remote audiogram assessment for rehabilitative audiology, this study investigated whether calibration-independent adaptive categorical loudness scaling (ACALOS) data can be used to approximate individual audiograms by classifying listeners into standard Bisgaard audiogram types using machine learning. Three classes of machine learning approaches - unsupervised, supervised, and explainable - were evaluated. Principal component analysis (PCA) was performed to extract the first two principal components, which together explained more than 50 percent of the variance. Seven supervised multi-class classifiers were trained and compared, alongside unsupervised and explainable methods. Model development and evaluation used a large auditory reference database containing ACALOS data (N = 847). The PCA factor map showed substantial overlap between listeners, indicating that cleanly separating participants into six Bisgaard classes based solely on their loudness patterns is challenging. Nevertheless, the models demonstrated reasonable classification performance, with logistic regression achieving the highest accuracy among supervised approaches. These findings demonstrate that machine learning models can predict standard Bisgaard audiogram types, within certain limits, from calibration-independent loudness perception data, supporting potential applications in remote or resource-limited settings without requiring a traditional audiogram.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates using calibration-independent adaptive categorical loudness scaling (ACALOS) data to classify listeners into one of six standard Bisgaard audiogram types. It applies unsupervised PCA (first two components explain >50% variance but with substantial class overlap), trains and compares seven supervised multi-class classifiers on a reference database (N=847), and incorporates explainable ML methods. Logistic regression achieves the highest accuracy; the authors conclude that the approach can approximate Bisgaard types within certain limits, supporting remote or resource-limited audiometry without traditional calibrated audiograms.
Significance. If the reported classification performance holds after addressing overlap and validation details, the work could enable practical remote hearing assessment by linking loudness perception patterns directly to standard audiogram shapes without specialized equipment. The combination of unsupervised, supervised, and explainable techniques provides a transparent framework that may generalize to other perceptual data in audiology.
major comments (2)
- [PCA Results] PCA factor map (Results section): the first two principal components explain more than 50% of variance yet exhibit substantial overlap across the six Bisgaard classes. This overlap indicates that the dominant directions of variation in the ACALOS loudness data do not separate the target classes cleanly, raising the possibility that any supervised accuracy arises from residual dimensions, imbalance, or chance rather than robust, generalizable structure needed for the central claim.
- [Supervised Classification Results] Supervised classification evaluation (Results section): logistic regression is identified as highest-accuracy, but the manuscript provides insufficient detail on cross-validation strategy, stratification for class imbalance, per-class metrics, or statistical comparison against chance-level or majority-class baselines. Given the acknowledged overlap, these omissions prevent assessment of whether performance is meaningful or load-bearing for the remote-audiogram approximation claim.
minor comments (2)
- [Methods] Clarify the precise input feature representation (e.g., number of loudness scaling points or derived statistics) fed to the classifiers and any preprocessing steps applied to the N=847 database.
- [Explainable Methods] Expand the explainable ML analysis with concrete examples or visualizations showing how specific loudness levels contribute to individual class predictions.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have prompted us to clarify and strengthen several aspects of our analysis. We address each major comment in turn.
read point-by-point responses
-
Referee: [PCA Results] PCA factor map (Results section): the first two principal components explain more than 50% of variance yet exhibit substantial overlap across the six Bisgaard classes. This overlap indicates that the dominant directions of variation in the ACALOS loudness data do not separate the target classes cleanly, raising the possibility that any supervised accuracy arises from residual dimensions, imbalance, or chance rather than robust, generalizable structure needed for the central claim.
Authors: The manuscript already notes the substantial overlap in the PCA factor map and the resulting challenge in cleanly separating the six Bisgaard classes. The supervised models, however, operate on the complete set of ACALOS features rather than being restricted to the first two principal components. To address concerns about whether performance exceeds what could be expected from imbalance or chance, we have added comparisons to majority-class and chance-level baselines in the revised manuscript. We also provide additional discussion on the contribution of higher principal components to classification. revision: partial
-
Referee: [Supervised Classification Results] Supervised classification evaluation (Results section): logistic regression is identified as highest-accuracy, but the manuscript provides insufficient detail on cross-validation strategy, stratification for class imbalance, per-class metrics, or statistical comparison against chance-level or majority-class baselines. Given the acknowledged overlap, these omissions prevent assessment of whether performance is meaningful or load-bearing for the remote-audiogram approximation claim.
Authors: We agree that more methodological detail is needed for a full evaluation of the results. In the revised manuscript, we have expanded the description of the cross-validation procedure to specify the use of stratified k-fold cross-validation to account for class imbalance. We now report per-class metrics (precision, recall, F1-score) in addition to overall accuracy. Furthermore, we include statistical comparisons of model performance against chance-level and majority-class baselines, using McNemar's test or similar appropriate methods. These revisions should enable readers to assess the robustness of the findings despite the class overlap. revision: yes
Circularity Check
No circularity: standard empirical ML classification on external reference database
full rationale
The paper performs PCA for dimensionality reduction on ACALOS loudness scaling data, trains seven standard supervised multi-class classifiers (including logistic regression), and applies unsupervised/explainable methods to predict pre-existing Bisgaard audiogram classes from a large external reference database (N=847). Model evaluation uses held-out or cross-validated performance metrics on this independent data rather than any self-referential fit; no equations or steps reduce a claimed prediction to its own inputs by construction, no uniqueness theorems are imported via self-citation, and no ansatz or renaming of known results is presented as a derivation. The work is self-contained empirical modeling whose central claim rests on observable classification accuracy, not tautological re-expression of fitted parameters.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ACALOS data from the reference database is representative of the target population for remote assessment.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The PCA factor map showed substantial overlap between listeners, indicating that cleanly separating participants into six Bisgaard classes based solely on their loudness patterns is challenging. Nevertheless, the models demonstrated reasonable classification performance, with logistic regression achieving the highest accuracy among supervised approaches.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
RQ1: To what extent is it feasible to use uncalibrated ACALOS data for estimating the corresponding Bisgaard class?
-
[2]
RQ2: How can the statistical dependence between ACALOS data and the audiogram be quantified, and can machine learning methods be employed to systematically demonstrate this dependence?
-
[3]
RQ3: What are the strengths and limitations of the three machine learning methods examined in this study? Materials and methods Description of the data set The data set used in this study was provided by Hörzentrum Oldenburg gGmbH. It is a superset of the publicly available Oldenburg Hearing Health Repository (OHHR; Jafri et al., 2024) and comprises a com...
work page 2024
-
[4]
have demonstrated that age is a significant predictor of hearing loss, we excluded demographic data from our analysis, as the focus of this study was to predict the standard audiograms based solely on loudness measures. For this study, we extracted a subset of the larger database, focusing on the audiogram and loudness scaling data, only. The audiogram da...
work page 2002
-
[5]
and a standard deviation of 5 dB was applied to the features L2.5, L25, L50, and LCUT (please note that both ears of each participant had the same offset, as these parameters are likely influenced by device calibration). In contrast, the variables mhigh and mlow were left unchanged, as they represent relative quantities and are thus independent of calibra...
work page 2017
-
[6]
Unsupervised learning (PCA) offered insights into the structure of the feature space, highlighting which features drove variability and revealing clusters of feature types. However, PCA projections were insufficient for classifying six groups, underscoring the complexity of the data and the limits of unsupervised clustering for this task
-
[7]
Supervised classifiers directly addressed the prediction of audiograms. Logistic Regression emerged as the most effective model, consistent with literature showing that linear models often outperform complex ones in small-to-moderate datasets (<20k records; Kass, 2019). In contrast, KNN struggled with the curse of dimensionality, confirming that not all m...
work page 2019
-
[8]
Calibration offset estimation in mobile hearing tests via categorical loudness scaling
Explainable ML methods bridged predictive modeling with audiological interpretation, identifying the most and least relevant features. This step is crucial for clinical translation, as it clarifies why models make certain predictions and builds trust among practitioners. Overall, each ML paradigm contributes differently: unsupervised methods aid feature i...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.14808650 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.