Calibration offset estimation in mobile hearing tests via categorical loudness scaling

Birger Kollmeier; Chen Xu

arxiv: 2508.14824 · v1 · submitted 2025-08-20 · ⚛️ physics.med-ph

Calibration offset estimation in mobile hearing tests via categorical loudness scaling

Chen Xu , Birger Kollmeier This is my paper

Pith reviewed 2026-05-18 21:56 UTC · model grok-4.3

classification ⚛️ physics.med-ph

keywords categorical loudness scalingcalibration offset estimationmobile hearing testssmartphone assessmentsBayesian regressionhearing healthcaredynamic rangeOHHR dataset

0 comments

The pith

Categorical loudness scaling estimates calibration offsets in mobile hearing tests with correlations up to 0.81.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops prediction models that use categorical loudness scaling results to estimate and correct device calibration offsets for smartphone hearing assessments. It simulates offsets from a Gaussian distribution and trains a Bayesian regression model plus a nearest neighbor model on data from 847 older adults drawn from the Oldenburg Hearing Health Repository. The models exploit level-independent CLS measures such as dynamic range, which stay stable even when device output levels are wrong. These steps matter because they remove a major barrier to accurate hearing tests performed outside controlled lab settings and on users' own phones.

Core claim

The paper establishes that CLS-based models can compensate for missing calibration by predicting device offsets from loudness scaling parameters. The Bayesian regression model reaches correlations of up to 0.81 between estimated and true offsets, while both models reduce calibration uncertainty by factors between 0.41 and 0.79 relative to threshold-based methods. This holds because CLS supplies measures that remain robust to arbitrary level shifts, allowing individual-level correction from the OHHR dataset.

What carries the argument

Bayesian regression and nearest neighbor models trained on level-independent CLS parameters such as dynamic range drawn from the OHHR dataset.

Load-bearing premise

The assumption that simulated Gaussian offsets and CLS parameters from the OHHR dataset will generalize to real uncontrolled mobile environments with arbitrary device offsets.

What would settle it

Collect CLS responses and actual measured calibration offsets from users on their own smartphones in everyday settings, then test whether the model predictions match the measured offsets within the reported uncertainty range.

read the original abstract

Objective: To enable reliable smartphone-based hearing assessments by developing methods to estimate device calibration offsets using categorical loudness scaling (CLS). Design: Calibration offsets were simulated from a Gaussian distribution. Two prediction models - a Bayesian regression model and a nearest neighbor model - were trained on CLS-derived parameters and data from the Oldenburg Hearing Health Repository (OHHR). CLS was chosen because it provides level-independent measures (e.g., dynamic range) that remain robust despite calibration errors. Study Sample: The dataset comprised CLS results from N = 847 participants with a mean age of 70.0 years (SD = 8.7), including 556 male and 291 female listeners with diverse hearing profiles. Results: The Bayesian regression model achieved correlations of up to 0.81 between estimated and true calibration offsets, enabling accurate individual-level correction. Compared to threshold-based approaches, calibration uncertainty was reduced by factors between 0.41 and 0.79, demonstrating greater robustness in uncontrolled environments. Conclusions: CLS-based models can effectively compensate for missing calibration in mobile hearing assessments. This approach provides a practical alternative to threshold-based methods, supporting the use of smartphone-based tests outside laboratory settings and expanding access to reliable hearing healthcare in everyday and resource-limited contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops and evaluates two models (Bayesian regression and nearest-neighbor) to estimate unknown calibration offsets in smartphone-based hearing tests. Offsets are simulated from a Gaussian distribution and added to categorical loudness scaling (CLS) data from the Oldenburg Hearing Health Repository (N=847 participants). Level-independent CLS parameters such as dynamic range are used as predictors; the models achieve correlations up to 0.81 with the simulated ground-truth offsets and reduce uncertainty by factors of 0.41–0.79 relative to threshold-based methods.

Significance. If the reported performance generalizes beyond the simulated setting, the work would provide a practical route to calibration-free mobile hearing assessments, potentially increasing accessibility in non-laboratory environments. The use of CLS parameters that are designed to be robust to level shifts is a conceptually attractive choice, and the quantitative comparison against threshold methods is a clear strength.

major comments (2)

[Results] Results and Methods sections: All performance figures (r ≤ 0.81, uncertainty reductions 0.41–0.79) are obtained exclusively on data with artificially added Gaussian offsets. No experiments using measured calibration errors from actual smartphones or uncontrolled listening conditions are presented, leaving the central claim of applicability to real mobile environments unsupported by direct evidence.
[Methods] Methods: The manuscript provides no information on cross-validation strategy, train/test splits, or regularization choices for the Bayesian regression and nearest-neighbor models. Without these details it is impossible to assess whether the reported correlations reflect genuine predictive power or overfitting to the particular simulation.

minor comments (2)

[Abstract] Abstract: The phrase 'reducing uncertainty by factors between 0.41 and 0.79' is ambiguous; clarify whether these are multiplicative factors on standard deviation or on variance.
The manuscript would benefit from an explicit statement of the assumed distribution parameters for the simulated offsets and from a sensitivity analysis showing how results change when the Gaussian assumption is relaxed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Results] Results and Methods sections: All performance figures (r ≤ 0.81, uncertainty reductions 0.41–0.79) are obtained exclusively on data with artificially added Gaussian offsets. No experiments using measured calibration errors from actual smartphones or uncontrolled listening conditions are presented, leaving the central claim of applicability to real mobile environments unsupported by direct evidence.

Authors: We agree that the reported results are derived from simulated Gaussian calibration offsets added to the CLS data. This controlled simulation was selected to establish known ground truth and to isolate the contribution of level-independent CLS features such as dynamic range. Because these features are constructed to be invariant to uniform level shifts, the simulation provides a direct test of the method’s core mechanism. We acknowledge, however, that direct evidence from measured smartphone offsets or fully uncontrolled conditions is not presented. The revised manuscript will add a dedicated limitations paragraph that explicitly discusses the simulation assumptions, their relation to real-device errors, and the need for subsequent field validation studies. revision: partial
Referee: [Methods] Methods: The manuscript provides no information on cross-validation strategy, train/test splits, or regularization choices for the Bayesian regression and nearest-neighbor models. Without these details it is impossible to assess whether the reported correlations reflect genuine predictive power or overfitting to the particular simulation.

Authors: We appreciate the referee highlighting this gap. The revised Methods section will specify that participant-level data were randomly partitioned into an 80 % training and 20 % test set, with 5-fold cross-validation performed within the training set to select hyperparameters. For the Bayesian regression model, weakly informative normal priors were placed on the coefficients and a half-Cauchy prior on the residual scale; no further explicit regularization was applied. For the nearest-neighbor model, Euclidean distance was used with k = 5, where k was chosen by inner cross-validation. These details will be added so that readers can evaluate the risk of overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity: simulation-based prediction of external offsets from independent CLS parameters

full rationale

The paper simulates Gaussian calibration offsets, adds them to the existing OHHR CLS dataset (N=847), extracts level-independent parameters such as dynamic range, and trains separate Bayesian regression and nearest-neighbor models whose target is the simulated offset value. Reported correlations (up to 0.81) and uncertainty-reduction factors are computed between these model predictions and the independently generated simulated offsets; no equation equates the output to a fitted input by construction, no self-citation supplies a uniqueness theorem, and the central evaluation remains a standard supervised simulation study whose inputs and targets are distinct.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that CLS yields level-independent measures robust to calibration error and on the representativeness of simulated Gaussian offsets and the OHHR training set.

free parameters (1)

Gaussian parameters for simulated offsets
Offsets drawn from an unspecified Gaussian distribution to generate training targets.

axioms (1)

domain assumption CLS provides level-independent measures (e.g., dynamic range) that remain robust despite calibration errors.
Explicitly stated in the design paragraph of the abstract.

pith-pipeline@v0.9.0 · 5746 in / 1163 out tokens · 45939 ms · 2026-05-18T21:56:47.856027+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Calibration offsets were simulated from a Gaussian distribution. Two prediction models—a Bayesian regression model and a nearest neighbor model—were trained on CLS-derived parameters... level-independent measures (e.g., dynamic range)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The dataset comprised CLS results from N = 847 participants... ACALOS procedure

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Standard audiogram classification from loudness scaling data using unsupervised, supervised, and explainable machine learning techniques
cs.SD 2025-12 unverdicted novelty 4.0

Machine learning models can predict standard Bisgaard audiogram types from calibration-independent ACALOS loudness data with reasonable accuracy despite substantial class overlap.