pith. machine review for the scientific record. sign in

arxiv: 2605.03039 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.HC· cs.SD

Recognition: 3 theorem links

· Lean Theorem

Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HCcs.SD
keywords mixed-precision quantizationinformation bottleneckvoice biomarkersbipolar disordertrait-state disentanglementon-device inferenceagitation detection
0
0 comments X

The pith

Mixed-precision quantization creates an information bottleneck that separates speaker identity from agitation states in voice without adversarial training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MP-IB to use numerical precision limits as a way to disentangle stable speaker traits from volatile agitation states in audio for bipolar monitoring. Higher-precision FP16 processing is assigned to a trait head while lower-precision INT4 processing is assigned to a state head, producing an 8x capacity difference that drives separation through bit-width constraints alone. The design adds dynamic scheduling and multi-scale fusion to support this on edge hardware, yielding measurable correlation with agitation scores and strong suppression of identity information. Readers would care because the result points to a lightweight route for privacy-preserving, real-time clinical voice analysis on inexpensive devices.

Core claim

MP-IB treats mixed-precision quantization as an information bottleneck in which an FP16 trait head with 1,024 bits encodes speaker identity while an INT4 state head with 128 bits captures agitation, creating an 8x information asymmetry that achieves trait-state separation without adversarial losses. On the Bridge2AI-Voice dataset the approach reaches a correlation of rho = 0.117 and outperforms larger models and other disentanglement baselines, while also delivering zero-shot transfer to CREMA-D and near-random identity leakage metrics.

What carries the argument

Mixed-precision information bottleneck implemented through an FP16 trait head and INT4 state head, together with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, that limits capacity to enforce separation.

If this is right

  • The model runs at 23.4 ms latency with a 617 KB footprint, enabling real-time use on devices costing under 20 dollars.
  • Identity leakage drops to near-random levels with an EER of 0.42 and MIA-AUC of 0.52.
  • Zero-shot performance on CREMA-D reaches an AUC of 0.817.
  • Correlation gains of 2.8 to 15.9 points are obtained over a 94 M-parameter WavLM-Adapter, beta VAE, and hand-crafted prosody features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same capacity-asymmetry principle could be tested for disentangling stable versus transient signals in other sensor modalities such as accelerometer or ECG data.
  • If the precision schedule generalizes, training pipelines for disentanglement tasks could become simpler by removing the need for adversarial objectives.
  • Deployment on wearables might extend beyond bipolar monitoring to other longitudinal health states where identity must stay hidden.

Load-bearing premise

The chosen bit widths and lack of adversarial losses are enough by themselves to produce the observed trait-state separation through capacity limits.

What would settle it

A controlled ablation in which the FP16 and INT4 heads receive identical bit widths yet still match the reported correlation and leakage results would show that the precision asymmetry is not the operative mechanism.

Figures

Figures reproduced from arXiv: 2605.03039 by Joydeep Chandra.

Figure 1
Figure 1. Figure 1: System Architecture of MP-IB 4 view at source ↗
Figure 2
Figure 2. Figure 2: t-SNE visualization of MP-IB representations. The trait head (a) forms distinct clusters enabling speaker recognition, while the state head (b) exhibits high entropy for identity (H ≈ log2 (120)), confirming successful disentanglement via the precision-based bottleneck view at source ↗
Figure 3
Figure 3. Figure 3: State embeddings colored by agitation score. The gradi￾ent from low (blue) to high (red) confirms that while identity is suppressed, the clinical signal is preserved and properly distributed in latent space (n = 822) view at source ↗
Figure 4
Figure 4. Figure 4: High-fidelity wavelet analysis (top) and attentional saliency (bottom) on a bipolar disorder sample. 5.4. Optimal Bit-Width Allocation INT4 is the optimal operating point ( view at source ↗
read the original abstract

Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MP-IB, a mixed-precision information bottleneck framework that uses quantization precision as a capacity constraint to disentangle speaker traits (via FP16 trait head, 1024 bits) from affective states (via INT4 state head, 128 bits) for on-device bipolar agitation detection from voice, without adversarial losses. Augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, it reports rho=0.117 (95% CI [0.089,0.145]) on Bridge2AI-Voice (N=833, speaker-independent CV), outperforming WavLM-Adapter, beta-VAE, and prosody baselines, plus zero-shot AUC=0.817 on CREMA-D, low identity leakage (EER=0.42), and a 617KB/23.4ms footprint.

Significance. If the precision-asymmetry mechanism is shown to be causal, the work provides a lightweight alternative to adversarial disentanglement for clinical voice biomarkers, with direct implications for real-time edge deployment in mental health monitoring. The concrete metrics, CIs, and baseline comparisons are strengths, though the absence of controls leaves the core novelty unverified.

major comments (3)
  1. [Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.
  2. [Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.
  3. [Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.
minor comments (2)
  1. The manuscript would benefit from an appendix with full training hyperparameters, optimizer settings, and exact model architectures to support reproducibility of the reported EER and AUC values.
  2. [Results] Figure captions and tables should explicitly state the number of runs or seeds used for the 95% CIs to clarify statistical robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments identify key areas where additional evidence would strengthen the central claims about the mixed-precision information bottleneck. We agree that the manuscript would benefit from further controls and analyses, and we will incorporate revisions accordingly. Below we respond point-by-point to the major comments.

read point-by-point responses
  1. Referee: [Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.

    Authors: We acknowledge that isolating the contribution of the precision asymmetry requires explicit controls. The current experiments demonstrate consistent gains over strong baselines under speaker-independent evaluation, but we agree that uniform-precision ablations, bit-width sweeps, and component removals are necessary to substantiate the causal role of the 8x capacity constraint. In the revised manuscript we will add these ablations to the Experimental Results section, including (i) all-FP16 and all-INT4 variants, (ii) sweeps over state-head bit widths from 2 to 8 bits, and (iii) versions without Dynamic Precision Scheduling and without Multi-Scale Temporal Fusion. These additions will directly test whether the reported performance depends on the mixed-precision bottleneck. revision: yes

  2. Referee: [Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.

    Authors: We agree that direct empirical verification of the information capacities would reinforce the information-bottleneck interpretation. Although the performance metrics and low identity leakage (EER=0.42) are consistent with the intended separation, we will add the requested analyses in the revision. Specifically, we will include linear probing accuracies for speaker identity on both heads, mutual-information estimates between representations and speaker labels (where computationally feasible), and qualitative representation visualizations (e.g., t-SNE) showing trait versus state clustering. These will be presented in a new subsection of the Results to confirm that the INT4 head is capacity-limited with respect to identity while the FP16 head retains trait information. revision: yes

  3. Referee: [Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.

    Authors: We apologize for the omission of the explicit derivation. The stated bit capacities follow from the output dimensionality of each head multiplied by the effective bits per value after quantization (FP16 at 16 bits, INT4 at 4 bits), adjusted for the architectural dimensions of the respective heads. In the revised Method section we will provide the full calculation, including the precise layer dimensions, the formula used to obtain 1024 bits for the trait head and 128 bits for the state head, and a brief discussion of how quantization imposes an upper bound on mutual information under the information-bottleneck framework. This will make the 8x asymmetry verifiable from the architecture description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on empirical evaluation on held-out data rather than self-referential definitions or fits.

full rationale

The paper presents MP-IB as using mixed-precision quantization (FP16 trait head at 1024 bits vs. INT4 state head at 128 bits) to create an 8x information asymmetry for trait-state disentanglement without adversarial losses, augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. This is validated via rho=0.117 (p=0.003) on speaker-independent CV splits of Bridge2AI-Voice (N=833) and zero-shot AUC=0.817 on CREMA-D, with identity leakage near chance (EER=0.42). No load-bearing step reduces the reported gains or the capacity-control insight to a fitted parameter renamed as prediction, a self-citation chain, or an equation that is true by construction. The performance numbers and comparisons to WavLM-Adapter, beta VAE, and prosody baselines are external and falsifiable on the stated splits; the assumption that quantization strictly caps mutual information is stated as an insight, not derived from the paper's own outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that bit-width directly limits representational capacity in a controllable way for disentanglement, plus two hand-chosen bit widths treated as design parameters.

free parameters (2)
  • Trait head precision = FP16 (1024 bits)
    Chosen to provide high capacity for speaker identity
  • State head precision = INT4 (128 bits)
    Chosen to enforce low capacity for agitation features
axioms (1)
  • domain assumption Numerical precision directly controls the information capacity of separate model heads
    Core insight invoked to justify the 8x asymmetry without additional losses

pith-pipeline@v0.9.0 · 5576 in / 1394 out tokens · 42357 ms · 2026-05-08T19:06:02.891361+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 41 canonical work pages · 1 internal anchor

  1. [2]

    Acoustic and Natural Language Markers for Bipolar Disorder: A Pilot, mHealth Cross-Sectional Study

    Crocamo, Cristina and Cioni, Riccardo Matteo and Canestro, Aurelia and Nasti, Christian and Palpella, Dario and Piacenti, Susanna and Bartoccetti, Alessandra and Re, Martina and Simonetti, Valentina and Barattieri di San Pietro, Chiara and Bulgheroni, Maria and Bartoli, Francesco and Carr \`a , Giuseppe. Acoustic and Natural Language Markers for Bipolar D...

  2. [5]

    2025 , eprint=

    Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion , author=. 2025 , eprint=

  3. [6]

    ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=

    Chou, Hsing-Hang and Lin, Yun-Shao and Sung, Ching-Chin and Tsao, Yu and Lee, Chi-Chun , year=. ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=. doi:10.21437/interspeech.2025-1101 , booktitle=

  4. [7]

    2025 , eprint=

    ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training , author=. 2025 , eprint=

  5. [8]

    2024 , eprint=

    Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder , author=. 2024 , eprint=

  6. [9]

    2024 , eprint=

    Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement , author=. 2024 , eprint=

  7. [10]

    2025 , eprint=

    Provable Speech Attributes Conversion via Latent Independence , author=. 2025 , eprint=

  8. [11]

    2025 , eprint=

    Learning Source Disentanglement in Neural Audio Codec , author=. 2025 , eprint=

  9. [12]

    2025 , eprint=

    Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation , author=. 2025 , eprint=

  10. [13]

    2025 , eprint=

    Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning , author=. 2025 , eprint=

  11. [14]

    2025 , eprint=

    Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition , author=. 2025 , eprint=

  12. [15]

    2025 , eprint=

    Effective and Efficient Mixed Precision Quantization of Speech Foundation Models , author=. 2025 , eprint=

  13. [16]

    2025 , eprint=

    Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision , author=. 2025 , eprint=

  14. [17]

    2025 , eprint=

    StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models , author=. 2025 , eprint=

  15. [18]

    2023 , eprint=

    2-bit Conformer quantization for automatic speech recognition , author=. 2023 , eprint=

  16. [19]

    2025 , eprint=

    Quantization for OpenAI's Whisper Models: A Comparative Analysis , author=. 2025 , eprint=

  17. [20]

    2026 , eprint=

    Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models , author=. 2026 , eprint=

  18. [21]

    2026 , eprint=

    Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation , author=. 2026 , eprint=

  19. [22]

    QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=

    Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima , year=. QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=. doi:10.21437/interspeech.2025-532 , booktitle=

  20. [23]

    Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition , volume=

    Xu, Junhao and Yu, Jianwei and Hu, Shoukang and Liu, Xunying and Meng, Helen , year=. Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition , volume=. doi:10.1109/taslp.2021.3129357 , journal=

  21. [24]

    Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

    Layer-Importance guided Adaptive Quantization for Efficient Speech Emotion Recognition , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

  22. [25]

    2024 , eprint=

    Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition , author=. 2024 , eprint=

  23. [26]

    2024 , eprint=

    SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training , author=. 2024 , eprint=

  24. [27]

    2024 , eprint=

    A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization , author=. 2024 , eprint=

  25. [28]

    Translational Psychiatry , volume=

    Voice analysis as an objective state marker in bipolar disorder , author=. Translational Psychiatry , volume=

  26. [29]

    2022 , eprint=

    Multi-Task Learning for Depression Detection in Dialogs , author=. 2022 , eprint=

  27. [30]

    Sensors , VOLUME =

    Kamińska, Dorota and Kamińska, Olga and Sochacka, Małgorzata and Sokół-Szawłowska, Marlena , TITLE =. Sensors , VOLUME =. 2024 , NUMBER =

  28. [32]

    Health Informatics Journal , volume =

    Mireia Farrús and Joan Codina-Filbà and Joan Escudero , title =. Health Informatics Journal , volume =. 2021 , doi =

  29. [33]

    and Bentley, Kate H

    Low, Daniel M. and Bentley, Kate H. and Ghosh, Satrajit S. , title =. Laryngoscope Investigative Otolaryngology , volume =. doi:https://doi.org/10.1002/lio2.354 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/lio2.354 , year =

  30. [34]

    BMC Psychiatry , volume=

    The voice of depression: Speech features as biomarkers for major depressive disorder , author=. BMC Psychiatry , volume=. 2024 , doi=

  31. [36]

    2014 , publisher=

    Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini , journal=. 2014 , publisher=

  32. [37]

    2019 , eprint=

    Searching for MobileNetV3 , author=. 2019 , eprint=

  33. [39]

    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

    X-Vectors: Robust DNN Embeddings for Speaker Recognition , author=. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

  34. [40]

    2000 , eprint=

    The information bottleneck method , author=. 2000 , eprint=

  35. [41]

    2019 , eprint=

    Deep Variational Information Bottleneck , author=. 2019 , eprint=

  36. [42]

    Domain-adversarial training of neural networks , year =

    Ganin, Yaroslav and Ustinova, Evgeniya and Ajakan, Hana and Germain, Pascal and Larochelle, Hugo and Laviolette, Fran. Domain-adversarial training of neural networks , year =. J. Mach. Learn. Res. , month = jan, pages =

  37. [43]

    2013 , eprint=

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. 2013 , eprint=

  38. [44]

    2016 , eprint=

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , author=. 2016 , eprint=

  39. [46]

    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

    Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech , author=. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

  40. [47]

    2025 , note=

    Mental Disorders: Fact Sheet on Bipolar Disorder , author=. 2025 , note=

  41. [49]

    2018 , eprint=

    CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , author=. 2018 , eprint=

  42. [50]

    2020 , eprint=

    TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices , author=. 2020 , eprint=

  43. [51]

    2025 , eprint=

    Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models , author=. 2025 , eprint=

  44. [52]

    2026 , eprint=

    Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need? , author=. 2026 , eprint=

  45. [53]

    2025 , eprint=

    SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance , author=. 2025 , eprint=

  46. [54]

    Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps , url=

    Nilsson, Mattias and Miccini, Riccardo and Laroche, Clement and Piechowiak, Tobias and Zenke, Friedemann , year=. Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps , url=. doi:10.21437/interspeech.2024-1979 , booktitle=

  47. [55]

    2020 , eprint=

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author=. 2020 , eprint=

  48. [56]

    2021 , issue_date =

    Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman , title =. 2021 , issue_date =. doi:10.1109/TASLP.2021.3122291 , journal =

  49. [58]

    2021 , eprint=

    VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion , author=. 2021 , eprint=

  50. [59]

    2021 , eprint=

    Unsupervised Speech Decomposition via Triple Information Bottleneck , author=. 2021 , eprint=

  51. [60]

    Scientific Reports , volume=

    Depression recognition using voice-based pre-training model , author=. Scientific Reports , volume=. 2024 , doi=

  52. [61]

    2021 , eprint=

    MINE: Mutual Information Neural Estimation , author=. 2021 , eprint=

  53. [64]

    2023 , eprint=

    VaSAB: The variable size adaptive information bottleneck for disentanglement on speech and singing voice , author=. 2023 , eprint=

  54. [65]

    N., and Waibel, A

    Akti, S., Nguyen, T. N., and Waibel, A. Towards better disentanglement in non-autoregressive zero-shot expressive voice conversion, 2025. URL https://arxiv.org/abs/2506.04013

  55. [66]

    Devon , month = aug, year =

    Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: Mutual information neural estimation, 2021. URL https://arxiv.org/abs/1801.04062

  56. [67]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

    Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URL https://arxiv.org/abs/1308.3432

  57. [68]

    Bensoussan, Y., Sigaras, A., Rameau, A., Elemento, O., Powell, M., Dorr, D., Payne, P., Ravitsky, V., Bélisle-Pipon, J.-C., Bahr, R., Watts, S., Bolser, D., Siu, J., Lerner-Ellis, J., Rudzicz, F., Boyer, M., Abdel-Aty, Y., Ahmed Syed , T., Anibal, J., Amraei, D., Aradi, S., Armosh, K., Martinez, A. S., Awan, S., Bedrick, S., Beltran, H., Bernier, A., Berr...

  58. [69]

    and Roebel, A

    Bous, F. and Roebel, A. Vasab: The variable size adaptive information bottleneck for disentanglement on speech and singing voice, 2023. URL https://arxiv.org/abs/2310.03444

  59. [70]

    and Lechien, J

    Briganti, G. and Lechien, J. R. Voice quality as digital biomarker in bipolar disorder: A systematic review. Journal of Voice, 2025. ISSN 0892-1997. doi:https://doi.org/10.1016/j.jvoice.2025.01.002. URL https://www.sciencedirect.com/science/article/pii/S0892199725000049

  60. [71]

    G., Keutmann, M

    Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. CREMA-D : Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014

  61. [72]

    T., Qian, K., Schultz, T., and Schuller, B

    Chang, Y., Ren, Z., Zhao, Z., Nguyen, T. T., Qian, K., Schultz, T., and Schuller, B. W. Breaking resource barriers in speech emotion recognition via data distillation, 2025. URL https://arxiv.org/abs/2406.15119

  62. [73]

    Wavlm: Large-scale self-supervised pre-training for full stack speech processing

    Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., and Wei, F. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505–1518, October 2022....

  63. [74]

    Diemo-tts: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech

    Cho, D.-H., Oh, H.-S., Kim, S.-B., and Lee, S.-W. Diemo-tts: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. In Interspeech 2025, pp.\ 4373–4377. ISCA, August 2025. doi:10.21437/interspeech.2025-1394. URL http://dx.doi.org/10.21437/Interspeech.2025-1394

  64. [75]

    ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

    Desplanques, B., Thienpondt, J., and Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Interspeech 2020, pp.\ 3830–3834. ISCA, October 2020. doi:10.21437/interspeech.2020-2650. URL http://dx.doi.org/10.21437/Interspeech.2020-2650

  65. [76]

    Acoustic and prosodic information for home monitoring of bipolar disorder

    Farrús, M., Codina-Filbà, J., and Escudero, J. Acoustic and prosodic information for home monitoring of bipolar disorder. Health Informatics Journal, 27 0 (1): 0 1460458220972755, 2021. doi:10.1177/1460458220972755. URL https://doi.org/10.1177/1460458220972755. PMID: 33438502

  66. [77]

    M., Winther, O., Bardram, J

    Faurholt-Jepsen, M., Busk, J., Frost, M., Vinberg, M., Christensen, E. M., Winther, O., Bardram, J. E., and Kessing, L. V. Voice analysis as an objective state marker in bipolar disorder. Translational Psychiatry, 6 0 (7): 0 e856, 2016

  67. [78]

    K., Yuan, Z., and Zhang, X

    Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R. K., Yuan, Z., and Zhang, X. Edge-asr: Towards low-bit quantization of automatic speech recognition models, 2025. URL https://arxiv.org/abs/2507.07877

  68. [79]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016. URL https://arxiv.org/abs/1506.02142

  69. [80]

    Domain-adversarial training of neural networks

    Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17 0 (1): 0 2096–2030, January 2016. ISSN 1532-4435

  70. [81]

    Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025

    Hong, Y., Han, H., jin Chung, W., and Kang, H.-G. Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025. URL https://arxiv.org/abs/2504.14915

  71. [82]

    and Adam, Hartwig , keywords =

    Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., and Adam, H. Searching for mobilenetv3, 2019. URL https://arxiv.org/abs/1905.02244

  72. [83]

    Depression recognition using voice-based pre-training model

    Huang, X., Wang, F., Gao, Y., Liao, Y., Zhang, W., Zhang, L., and Xu, Z. Depression recognition using voice-based pre-training model. Scientific Reports, 14: 0 12734, 2024. doi:https://doi.org/10.1038/s41598-024-63556-0

  73. [84]

    Deep speaker embeddings for speaker verification: Review and experimental comparison

    Jakubec, M., Jarina, R., Lieskovska, E., and Kasak, P. Deep speaker embeddings for speaker verification: Review and experimental comparison. Eng. Appl. Artif. Intell., 127 0 (PA), January 2024. ISSN 0952-1976. doi:10.1016/j.engappai.2023.107232. URL https://doi.org/10.1016/j.engappai.2023.107232

  74. [85]

    Z., Hryniewicz, O., Kaminska, O., Opara, K., Owsinski, J., Radziszewska, W., Sochacka, M., and Swiecicki, L

    Kaczmarek-Majer, K., Dominiak, M., Antosik, A. Z., Hryniewicz, O., Kaminska, O., Opara, K., Owsinski, J., Radziszewska, W., Sochacka, M., and Swiecicki, L. Acoustic features from speech as markers of depressive and manic symptoms in bipolar disorder: A prospective study. Acta Psychiatrica Scandinavica, 151 0 (3): 0 358--374, 2025. doi:https://doi.org/10.1...

  75. [86]

    N., Provost, E

    Karam, Z. N., Provost, E. M., Singh, S., Montgomery, J., Archer, C., Harrington, G., and McInnis, M. G. Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 4858--4862, 2014. URL https://api.semanticscholar.org/CorpusID:10425398

  76. [87]

    Estimating Mutual Information

    Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical Review E, 69 0 (6), 2004. ISSN 1550-2376. doi:10.1103/physreve.69.066138. URL http://dx.doi.org/10.1103/PhysRevE.69.066138

  77. [88]

    CMSIS-NN: Efficient neural network kernels for ARM Cortex-M CPUs,

    Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus, 2018. URL https://arxiv.org/abs/1801.06601

  78. [89]

    Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis

    Li, T., Wang, X., Xie, Q., Wang, Z., and Xie, L. Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30: 0 1448–1460, April 2022. ISSN 2329-9290. doi:10.1109/TASLP.2022.3164181. URL https://doi.org/10.1109/TASLP.2022.3164181

  79. [90]

    Towards one-bit asr: Extremely low-bit conformer quantization using co-training and stochastic precision, 2025

    Li, Z., Xu, H., Jin, Z., Meng, L., Wang, T., Wang, H., Chen, Y., Cui, M., Hu, S., and Liu, X. Towards one-bit asr: Extremely low-bit conformer quantization using co-training and stochastic precision, 2025. URL https://arxiv.org/abs/2505.21245

  80. [91]

    Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls

    Pan, W., Deng, F., Wang, X., Hang, B., Zhou, W., and Zhu, T. Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls. Frontiers in Psychiatry, Volume 14 - 2023, 2023. ISSN 1664-0640. doi:10.3389/fpsyt.2023.1079448. URL https://www.frontiersin.org/journals/psychiatry/articles/10.3389...

Showing first 80 references.