arxiv: 2605.03039 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.HC· cs.SD

Recognition: 3 theorem links

· Lean Theorem

Mixed-Precision Information Bottlenecks for On-Device Trait-State Disentanglement in Bipolar Agitation Detection

Joydeep Chandra

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.HCcs.SD

keywords mixed-precision quantizationinformation bottleneckvoice biomarkersbipolar disordertrait-state disentanglementon-device inferenceagitation detection

0 comments

The pith

Mixed-precision quantization creates an information bottleneck that separates speaker identity from agitation states in voice without adversarial training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MP-IB to use numerical precision limits as a way to disentangle stable speaker traits from volatile agitation states in audio for bipolar monitoring. Higher-precision FP16 processing is assigned to a trait head while lower-precision INT4 processing is assigned to a state head, producing an 8x capacity difference that drives separation through bit-width constraints alone. The design adds dynamic scheduling and multi-scale fusion to support this on edge hardware, yielding measurable correlation with agitation scores and strong suppression of identity information. Readers would care because the result points to a lightweight route for privacy-preserving, real-time clinical voice analysis on inexpensive devices.

Core claim

MP-IB treats mixed-precision quantization as an information bottleneck in which an FP16 trait head with 1,024 bits encodes speaker identity while an INT4 state head with 128 bits captures agitation, creating an 8x information asymmetry that achieves trait-state separation without adversarial losses. On the Bridge2AI-Voice dataset the approach reaches a correlation of rho = 0.117 and outperforms larger models and other disentanglement baselines, while also delivering zero-shot transfer to CREMA-D and near-random identity leakage metrics.

What carries the argument

Mixed-precision information bottleneck implemented through an FP16 trait head and INT4 state head, together with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, that limits capacity to enforce separation.

If this is right

The model runs at 23.4 ms latency with a 617 KB footprint, enabling real-time use on devices costing under 20 dollars.
Identity leakage drops to near-random levels with an EER of 0.42 and MIA-AUC of 0.52.
Zero-shot performance on CREMA-D reaches an AUC of 0.817.
Correlation gains of 2.8 to 15.9 points are obtained over a 94 M-parameter WavLM-Adapter, beta VAE, and hand-crafted prosody features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same capacity-asymmetry principle could be tested for disentangling stable versus transient signals in other sensor modalities such as accelerometer or ECG data.
If the precision schedule generalizes, training pipelines for disentanglement tasks could become simpler by removing the need for adversarial objectives.
Deployment on wearables might extend beyond bipolar monitoring to other longitudinal health states where identity must stay hidden.

Load-bearing premise

The chosen bit widths and lack of adversarial losses are enough by themselves to produce the observed trait-state separation through capacity limits.

What would settle it

A controlled ablation in which the FP16 and INT4 heads receive identical bit widths yet still match the reported correlation and leakage results would show that the precision asymmetry is not the operative mechanism.

Figures

Figures reproduced from arXiv: 2605.03039 by Joydeep Chandra.

**Figure 1.** Figure 1: System Architecture of MP-IB 4 view at source ↗

**Figure 2.** Figure 2: t-SNE visualization of MP-IB representations. The trait head (a) forms distinct clusters enabling speaker recognition, while the state head (b) exhibits high entropy for identity (H ≈ log2 (120)), confirming successful disentanglement via the precision-based bottleneck view at source ↗

**Figure 3.** Figure 3: State embeddings colored by agitation score. The gradient from low (blue) to high (red) confirms that while identity is suppressed, the clinical signal is preserved and properly distributed in latent space (n = 822) view at source ↗

**Figure 4.** Figure 4: High-fidelity wavelet analysis (top) and attentional saliency (bottom) on a bipolar disorder sample. 5.4. Optimal Bit-Width Allocation INT4 is the optimal operating point ( view at source ↗

read the original abstract

Continuous monitoring of bipolar disorder agitation via voice biomarkers requires disentangling stable speaker traits from volatile affective states on resource-constrained edge devices. We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation. The core insight is that numerical precision itself controls capacity: an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8x information asymmetry without adversarial training. We augment this with Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. On Bridge2AI-Voice (N=833, 4 sessions/participant, strict speaker-independent CV), MP-IB achieves rho = 0.117 (95\% CI: [0.089, 0.145], p=0.003 vs. chance), outperforming 94M-parameter WavLM-Adapter with in-domain SSL continuation (rho = -0.042), beta VAE disentanglement (rho = 0.089), and hand-crafted prosody (rho = 0.031) by 2.8--15.9 points absolute. Zero-shot transfer to CREMA-D achieves AUC=0.817. Identity leakage is suppressed to near-random (EER=0.42, MIA-AUC=0.52). End-to-end latency is 23.4 ms with a 617 KB footprint, enabling real-time monitoring on sub 20 dollar devices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mixed-precision as an info bottleneck for voice trait-state separation is a neat framing with solid on-device numbers, but the causal link to precision asymmetry needs ablations to hold up.

read the letter

The punchline is that this paper uses mixed-precision quantization to create an information bottleneck that disentangles speaker traits from agitation states in voice recordings for bipolar monitoring, all without adversarial losses, and it runs efficiently on cheap devices. They report a modest but significant rho of 0.117 with confidence intervals on speaker-independent splits from Bridge2AI-Voice, plus zero-shot AUC on CREMA-D and a 617 KB model that hits 23 ms latency. That last part matters for real edge deployment. Framing the FP16 versus INT4 split itself as the capacity control is the fresh angle, and they layer on dynamic scheduling and multi-scale fusion to make it practical. Those additions show some engineering care. The baselines include a large WavLM adapter and beta-VAE, so the comparisons are relevant. The soft spot is the missing link between the precision asymmetry and the observed separation. Without uniform-precision controls, mutual information estimates, or probes on what each head actually encodes, the gains could easily trace to the shared layers or the fusion module instead. The assumption that INT4 strictly caps the state head while FP16 takes identity is reasonable on paper but untested in the reported results. The effect size stays small even if the mechanism works, which limits immediate clinical upside. This is aimed at researchers doing efficient ML for health signals or on-device disentanglement. Someone building real-time monitoring tools would pull the latency and size numbers. It has enough concrete claims and a clear application to deserve a serious referee who can ask for those controls and check the full methods.

Referee Report

3 major / 2 minor

Summary. The paper proposes MP-IB, a mixed-precision information bottleneck framework that uses quantization precision as a capacity constraint to disentangle speaker traits (via FP16 trait head, 1024 bits) from affective states (via INT4 state head, 128 bits) for on-device bipolar agitation detection from voice, without adversarial losses. Augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion, it reports rho=0.117 (95% CI [0.089,0.145]) on Bridge2AI-Voice (N=833, speaker-independent CV), outperforming WavLM-Adapter, beta-VAE, and prosody baselines, plus zero-shot AUC=0.817 on CREMA-D, low identity leakage (EER=0.42), and a 617KB/23.4ms footprint.

Significance. If the precision-asymmetry mechanism is shown to be causal, the work provides a lightweight alternative to adversarial disentanglement for clinical voice biomarkers, with direct implications for real-time edge deployment in mental health monitoring. The concrete metrics, CIs, and baseline comparisons are strengths, though the absence of controls leaves the core novelty unverified.

major comments (3)

[Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.
[Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.
[Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.

minor comments (2)

The manuscript would benefit from an appendix with full training hyperparameters, optimizer settings, and exact model architectures to support reproducibility of the reported EER and AUC values.
[Results] Figure captions and tables should explicitly state the number of runs or seeds used for the 95% CIs to clarify statistical robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments identify key areas where additional evidence would strengthen the central claims about the mixed-precision information bottleneck. We agree that the manuscript would benefit from further controls and analyses, and we will incorporate revisions accordingly. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Experimental results] Experimental results section: performance gains (rho=0.117 vs. baselines) are reported on held-out splits but without ablations on uniform-precision controls, bit-width sweeps, or removal of Dynamic Precision Scheduling/Multi-Scale Temporal Fusion. This is load-bearing for the central claim that the 8x asymmetry (FP16 vs. INT4) alone enforces trait-state separation via capacity limits.

Authors: We acknowledge that isolating the contribution of the precision asymmetry requires explicit controls. The current experiments demonstrate consistent gains over strong baselines under speaker-independent evaluation, but we agree that uniform-precision ablations, bit-width sweeps, and component removals are necessary to substantiate the causal role of the 8x capacity constraint. In the revised manuscript we will add these ablations to the Experimental Results section, including (i) all-FP16 and all-INT4 variants, (ii) sweeps over state-head bit widths from 2 to 8 bits, and (iii) versions without Dynamic Precision Scheduling and without Multi-Scale Temporal Fusion. These additions will directly test whether the reported performance depends on the mixed-precision bottleneck. revision: yes
Referee: [Method] Method and results: no probing accuracies, mutual information estimates, or representation analyses are provided to confirm that the INT4 state head cannot encode speaker identity (as required by the information-bottleneck framing) while the FP16 trait head ignores agitation.

Authors: We agree that direct empirical verification of the information capacities would reinforce the information-bottleneck interpretation. Although the performance metrics and low identity leakage (EER=0.42) are consistent with the intended separation, we will add the requested analyses in the revision. Specifically, we will include linear probing accuracies for speaker identity on both heads, mutual-information estimates between representations and speaker labels (where computationally feasible), and qualitative representation visualizations (e.g., t-SNE) showing trait versus state clustering. These will be presented in a new subsection of the Results to confirm that the INT4 head is capacity-limited with respect to identity while the FP16 head retains trait information. revision: yes
Referee: [Abstract] Abstract and method: the stated capacities (FP16 trait head = 1024 bits, INT4 state head = 128 bits) are presented without explicit derivation from layer dimensions or parameter counts, making it impossible to verify that quantization strictly caps mutual information as claimed.

Authors: We apologize for the omission of the explicit derivation. The stated bit capacities follow from the output dimensionality of each head multiplied by the effective bits per value after quantization (FP16 at 16 bits, INT4 at 4 bits), adjusted for the architectural dimensions of the respective heads. In the revised Method section we will provide the full calculation, including the precise layer dimensions, the formula used to obtain 1024 bits for the trait head and 128 bits for the state head, and a brief discussion of how quantization imposes an upper bound on mutual information under the information-bottleneck framework. This will make the 8x asymmetry verifiable from the architecture description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims rest on empirical evaluation on held-out data rather than self-referential definitions or fits.

full rationale

The paper presents MP-IB as using mixed-precision quantization (FP16 trait head at 1024 bits vs. INT4 state head at 128 bits) to create an 8x information asymmetry for trait-state disentanglement without adversarial losses, augmented by Dynamic Precision Scheduling and Multi-Scale Temporal Fusion. This is validated via rho=0.117 (p=0.003) on speaker-independent CV splits of Bridge2AI-Voice (N=833) and zero-shot AUC=0.817 on CREMA-D, with identity leakage near chance (EER=0.42). No load-bearing step reduces the reported gains or the capacity-control insight to a fitted parameter renamed as prediction, a self-citation chain, or an equation that is true by construction. The performance numbers and comparisons to WavLM-Adapter, beta VAE, and prosody baselines are external and falsifiable on the stated splits; the assumption that quantization strictly caps mutual information is stated as an insight, not derived from the paper's own outputs.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that bit-width directly limits representational capacity in a controllable way for disentanglement, plus two hand-chosen bit widths treated as design parameters.

free parameters (2)

Trait head precision = FP16 (1024 bits)
Chosen to provide high capacity for speaker identity
State head precision = INT4 (128 bits)
Chosen to enforce low capacity for agitation features

axioms (1)

domain assumption Numerical precision directly controls the information capacity of separate model heads
Core insight invoked to justify the 8x asymmetry without additional losses

pith-pipeline@v0.9.0 · 5576 in / 1394 out tokens · 42357 ms · 2026-05-08T19:06:02.891361+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.1 (Precision-Based Information Capacity). For z ∈ {−2^(b−1),…,2^(b−1)−1}^d with dimension d and b-bit precision: I(x;z) ≤ H(z) ≤ d·b bits.
Foundation.DimensionForcing / Foundation.AlexanderDuality alexander_duality_circle_linking (D=3 ⇒ 2^D = 8) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

an FP16 trait head (1,024 bits) encodes speaker identity, while an INT4 state head (128 bits) captures agitation, yielding 8× information asymmetry without adversarial training.
Foundation.BranchSelection / AlphaCoordinateFixation branch_selection / alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce MP-IB, the first framework to treat mixed-precision quantization as an information bottleneck for clinical trait-state separation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

94 extracted references · 41 canonical work pages · 1 internal anchor

[2]

Acoustic and Natural Language Markers for Bipolar Disorder: A Pilot, mHealth Cross-Sectional Study

Crocamo, Cristina and Cioni, Riccardo Matteo and Canestro, Aurelia and Nasti, Christian and Palpella, Dario and Piacenti, Susanna and Bartoccetti, Alessandra and Re, Martina and Simonetti, Valentina and Barattieri di San Pietro, Chiara and Bulgheroni, Maria and Bartoli, Francesco and Carr \`a , Giuseppe. Acoustic and Natural Language Markers for Bipolar D...

work page doi:10.2196/65555 2025
[5]

2025 , eprint=

Towards Better Disentanglement in Non-Autoregressive Zero-Shot Expressive Voice Conversion , author=. 2025 , eprint=

2025
[6]

ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=

Chou, Hsing-Hang and Lin, Yun-Shao and Sung, Ching-Chin and Tsao, Yu and Lee, Chi-Chun , year=. ZSDEVC: Zero-Shot Diffusion-based Emotional Voice Conversion with Disentangled Mechanism , url=. doi:10.21437/interspeech.2025-1101 , booktitle=

work page doi:10.21437/interspeech.2025-1101 2025
[7]

2025 , eprint=

ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training , author=. 2025 , eprint=

2025
[8]

2024 , eprint=

Speaker and Style Disentanglement of Speech Based on Contrastive Predictive Coding Supported Factorized Variational Autoencoder , author=. 2024 , eprint=

2024
[9]

2024 , eprint=

Zero Shot Audio to Audio Emotion Transfer With Speaker Disentanglement , author=. 2024 , eprint=

2024
[10]

2025 , eprint=

Provable Speech Attributes Conversion via Latent Independence , author=. 2025 , eprint=

2025
[11]

2025 , eprint=

Learning Source Disentanglement in Neural Audio Codec , author=. 2025 , eprint=

2025
[12]

2025 , eprint=

Breaking Resource Barriers in Speech Emotion Recognition via Data Distillation , author=. 2025 , eprint=

2025
[13]

2025 , eprint=

Meta-PerSER: Few-Shot Listener Personalized Speech Emotion Recognition via Meta-learning , author=. 2025 , eprint=

2025
[14]

2025 , eprint=

Learning More with Less: Self-Supervised Approaches for Low-Resource Speech Emotion Recognition , author=. 2025 , eprint=

2025
[15]

2025 , eprint=

Effective and Efficient Mixed Precision Quantization of Speech Foundation Models , author=. 2025 , eprint=

2025
[16]

2025 , eprint=

Towards One-bit ASR: Extremely Low-bit Conformer Quantization Using Co-training and Stochastic Precision , author=. 2025 , eprint=

2025
[17]

2025 , eprint=

StableQuant: Layer Adaptive Post-Training Quantization for Speech Foundation Models , author=. 2025 , eprint=

2025
[18]

2023 , eprint=

2-bit Conformer quantization for automatic speech recognition , author=. 2023 , eprint=

2023
[19]

2025 , eprint=

Quantization for OpenAI's Whisper Models: A Comparative Analysis , author=. 2025 , eprint=

2025
[20]

2026 , eprint=

Evolution Strategy-Based Calibration for Low-Bit Quantization of Speech Models , author=. 2026 , eprint=

2026
[21]

2026 , eprint=

Emotion-Aware Quantization for Discrete Speech Representations: An Analysis of Emotion Preservation , author=. 2026 , eprint=

2026
[22]

QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=

Biswas, Subrata and Khan, Mohammad Nur Hossain and Islam, Bashima , year=. QUADS: Quantized Distillation Framework for Efficient Speech Language Understanding , url=. doi:10.21437/interspeech.2025-532 , booktitle=

work page doi:10.21437/interspeech.2025-532 2025
[23]

Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition , volume=

Xu, Junhao and Yu, Jianwei and Hu, Shoukang and Liu, Xunying and Meng, Helen , year=. Mixed Precision Low-Bit Quantization of Neural Network Language Models for Speech Recognition , volume=. doi:10.1109/taslp.2021.3129357 , journal=

work page doi:10.1109/taslp.2021.3129357 2021
[24]

Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

Layer-Importance guided Adaptive Quantization for Efficient Speech Emotion Recognition , author=. Workshop on Machine Learning and Compression, NeurIPS 2024 , year=

2024
[25]

2024 , eprint=

Re-Parameterization of Lightweight Transformer for On-Device Speech Emotion Recognition , author=. 2024 , eprint=

2024
[26]

2024 , eprint=

SDP4Bit: Toward 4-bit Communication Quantization in Sharded Data Parallelism for LLM Training , author=. 2024 , eprint=

2024
[27]

2024 , eprint=

A Model for Every User and Budget: Label-Free and Personalized Mixed-Precision Quantization , author=. 2024 , eprint=

2024
[28]

Translational Psychiatry , volume=

Voice analysis as an objective state marker in bipolar disorder , author=. Translational Psychiatry , volume=
[29]

2022 , eprint=

Multi-Task Learning for Depression Detection in Dialogs , author=. 2022 , eprint=

2022
[30]

Sensors , VOLUME =

Kamińska, Dorota and Kamińska, Olga and Sochacka, Małgorzata and Sokół-Szawłowska, Marlena , TITLE =. Sensors , VOLUME =. 2024 , NUMBER =

2024
[32]

Health Informatics Journal , volume =

Mireia Farrús and Joan Codina-Filbà and Joan Escudero , title =. Health Informatics Journal , volume =. 2021 , doi =

2021
[33]

and Bentley, Kate H

Low, Daniel M. and Bentley, Kate H. and Ghosh, Satrajit S. , title =. Laryngoscope Investigative Otolaryngology , volume =. doi:https://doi.org/10.1002/lio2.354 , url =. https://onlinelibrary.wiley.com/doi/pdf/10.1002/lio2.354 , year =

work page doi:10.1002/lio2.354
[34]

BMC Psychiatry , volume=

The voice of depression: Speech features as biomarkers for major depressive disorder , author=. BMC Psychiatry , volume=. 2024 , doi=

2024
[36]

2014 , publisher=

Cao, Houwei and Cooper, David G and Keutmann, Michael K and Gur, Ruben C and Nenkova, Ani and Verma, Ragini , journal=. 2014 , publisher=

2014
[37]

2019 , eprint=

Searching for MobileNetV3 , author=. 2019 , eprint=

2019
[39]

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

X-Vectors: Robust DNN Embeddings for Speaker Recognition , author=. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

2018
[40]

2000 , eprint=

The information bottleneck method , author=. 2000 , eprint=

2000
[41]

2019 , eprint=

Deep Variational Information Bottleneck , author=. 2019 , eprint=

2019
[42]

Domain-adversarial training of neural networks , year =

Ganin, Yaroslav and Ustinova, Evgeniya and Ajakan, Hana and Germain, Pascal and Larochelle, Hugo and Laviolette, Fran. Domain-adversarial training of neural networks , year =. J. Mach. Learn. Res. , month = jan, pages =
[43]

2013 , eprint=

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , author=. 2013 , eprint=

2013
[44]

2016 , eprint=

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , author=. 2016 , eprint=

2016
[46]

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech , author=. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

2014
[47]

2025 , note=

Mental Disorders: Fact Sheet on Bipolar Disorder , author=. 2025 , note=

2025
[49]

2018 , eprint=

CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs , author=. 2018 , eprint=

2018
[50]

2020 , eprint=

TinySpeech: Attention Condensers for Deep Speech Recognition Neural Networks on Edge Devices , author=. 2020 , eprint=

2020
[51]

2025 , eprint=

Edge-ASR: Towards Low-Bit Quantization of Automatic Speech Recognition Models , author=. 2025 , eprint=

2025
[52]

2026 , eprint=

Learnable Pulse Accumulation for On-Device Speech Recognition: How Much Attention Do You Need? , author=. 2026 , eprint=

2026
[53]

2025 , eprint=

SpeechAgent: An End-to-End Mobile Infrastructure for Speech Impairment Assistance , author=. 2025 , eprint=

2025
[54]

Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps , url=

Nilsson, Mattias and Miccini, Riccardo and Laroche, Clement and Piechowiak, Tobias and Zenke, Friedemann , year=. Resource-Efficient Speech Quality Prediction through Quantization Aware Training and Binary Activation Maps , url=. doi:10.21437/interspeech.2024-1979 , booktitle=

work page doi:10.21437/interspeech.2024-1979 2024
[55]

2020 , eprint=

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations , author=. 2020 , eprint=

2020
[56]

2021 , issue_date =

Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman , title =. 2021 , issue_date =. doi:10.1109/TASLP.2021.3122291 , journal =

work page doi:10.1109/taslp.2021.3122291 2021
[58]

2021 , eprint=

VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion , author=. 2021 , eprint=

2021
[59]

2021 , eprint=

Unsupervised Speech Decomposition via Triple Information Bottleneck , author=. 2021 , eprint=

2021
[60]

Scientific Reports , volume=

Depression recognition using voice-based pre-training model , author=. Scientific Reports , volume=. 2024 , doi=

2024
[61]

2021 , eprint=

MINE: Mutual Information Neural Estimation , author=. 2021 , eprint=

2021
[64]

2023 , eprint=

VaSAB: The variable size adaptive information bottleneck for disentanglement on speech and singing voice , author=. 2023 , eprint=

2023
[65]

N., and Waibel, A

Akti, S., Nguyen, T. N., and Waibel, A. Towards better disentanglement in non-autoregressive zero-shot expressive voice conversion, 2025. URL https://arxiv.org/abs/2506.04013

work page arXiv 2025
[66]

Devon , month = aug, year =

Belghazi, M. I., Baratin, A., Rajeswar, S., Ozair, S., Bengio, Y., Courville, A., and Hjelm, R. D. Mine: Mutual information neural estimation, 2021. URL https://arxiv.org/abs/1801.04062

work page arXiv 2021
[67]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation

Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation, 2013. URL https://arxiv.org/abs/1308.3432

work page internal anchor Pith review arXiv 2013
[68]

Bensoussan, Y., Sigaras, A., Rameau, A., Elemento, O., Powell, M., Dorr, D., Payne, P., Ravitsky, V., Bélisle-Pipon, J.-C., Bahr, R., Watts, S., Bolser, D., Siu, J., Lerner-Ellis, J., Rudzicz, F., Boyer, M., Abdel-Aty, Y., Ahmed Syed , T., Anibal, J., Amraei, D., Aradi, S., Armosh, K., Martinez, A. S., Awan, S., Bedrick, S., Beltran, H., Bernier, A., Berr...

work page doi:10.13026/8xbn-nq66 2026
[69]

and Roebel, A

Bous, F. and Roebel, A. Vasab: The variable size adaptive information bottleneck for disentanglement on speech and singing voice, 2023. URL https://arxiv.org/abs/2310.03444

work page arXiv 2023
[70]

and Lechien, J

Briganti, G. and Lechien, J. R. Voice quality as digital biomarker in bipolar disorder: A systematic review. Journal of Voice, 2025. ISSN 0892-1997. doi:https://doi.org/10.1016/j.jvoice.2025.01.002. URL https://www.sciencedirect.com/science/article/pii/S0892199725000049

work page doi:10.1016/j.jvoice.2025.01.002 2025
[71]

G., Keutmann, M

Cao, H., Cooper, D. G., Keutmann, M. K., Gur, R. C., Nenkova, A., and Verma, R. CREMA-D : Crowd-sourced emotional multimodal actors dataset. IEEE transactions on affective computing, 5 0 (4): 0 377--390, 2014

2014
[72]

T., Qian, K., Schultz, T., and Schuller, B

Chang, Y., Ren, Z., Zhao, Z., Nguyen, T. T., Qian, K., Schultz, T., and Schuller, B. W. Breaking resource barriers in speech emotion recognition via data distillation, 2025. URL https://arxiv.org/abs/2406.15119

work page arXiv 2025
[73]

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., and Wei, F. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16 0 (6): 0 1505–1518, October 2022....

work page doi:10.1109/jstsp.2022.3188113 2022
[74]

Diemo-tts: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech

Cho, D.-H., Oh, H.-S., Kim, S.-B., and Lee, S.-W. Diemo-tts: Disentangled emotion representations via self-supervised distillation for cross-speaker emotion transfer in text-to-speech. In Interspeech 2025, pp.\ 4373–4377. ISCA, August 2025. doi:10.21437/interspeech.2025-1394. URL http://dx.doi.org/10.21437/Interspeech.2025-1394

work page doi:10.21437/interspeech.2025-1394 2025
[75]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Desplanques, B., Thienpondt, J., and Demuynck, K. Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification. In Interspeech 2020, pp.\ 3830–3834. ISCA, October 2020. doi:10.21437/interspeech.2020-2650. URL http://dx.doi.org/10.21437/Interspeech.2020-2650

work page doi:10.21437/interspeech.2020-2650 2020
[76]

Acoustic and prosodic information for home monitoring of bipolar disorder

Farrús, M., Codina-Filbà, J., and Escudero, J. Acoustic and prosodic information for home monitoring of bipolar disorder. Health Informatics Journal, 27 0 (1): 0 1460458220972755, 2021. doi:10.1177/1460458220972755. URL https://doi.org/10.1177/1460458220972755. PMID: 33438502

work page doi:10.1177/1460458220972755 2021
[77]

M., Winther, O., Bardram, J

Faurholt-Jepsen, M., Busk, J., Frost, M., Vinberg, M., Christensen, E. M., Winther, O., Bardram, J. E., and Kessing, L. V. Voice analysis as an objective state marker in bipolar disorder. Translational Psychiatry, 6 0 (7): 0 e856, 2016

2016
[78]

K., Yuan, Z., and Zhang, X

Feng, C., Lin, Y., Zhuo, S., Su, C., Ramakrishnan, R. K., Yuan, Z., and Zhang, X. Edge-asr: Towards low-bit quantization of automatic speech recognition models, 2025. URL https://arxiv.org/abs/2507.07877

work page arXiv 2025
[79]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Gal, Y. and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning, 2016. URL https://arxiv.org/abs/1506.02142

work page Pith review arXiv 2016
[80]

Domain-adversarial training of neural networks

Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F., Marchand, M., and Lempitsky, V. Domain-adversarial training of neural networks. J. Mach. Learn. Res., 17 0 (1): 0 2096–2030, January 2016. ISSN 1532-4435

2096
[81]

Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025

Hong, Y., Han, H., jin Chung, W., and Kang, H.-G. Stablequant: Layer adaptive post-training quantization for speech foundation models, 2025. URL https://arxiv.org/abs/2504.14915

work page arXiv 2025
[82]

and Adam, Hartwig , keywords =

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q. V., and Adam, H. Searching for mobilenetv3, 2019. URL https://arxiv.org/abs/1905.02244

work page arXiv 2019
[83]

Depression recognition using voice-based pre-training model

Huang, X., Wang, F., Gao, Y., Liao, Y., Zhang, W., Zhang, L., and Xu, Z. Depression recognition using voice-based pre-training model. Scientific Reports, 14: 0 12734, 2024. doi:https://doi.org/10.1038/s41598-024-63556-0

work page doi:10.1038/s41598-024-63556-0 2024
[84]

Deep speaker embeddings for speaker verification: Review and experimental comparison

Jakubec, M., Jarina, R., Lieskovska, E., and Kasak, P. Deep speaker embeddings for speaker verification: Review and experimental comparison. Eng. Appl. Artif. Intell., 127 0 (PA), January 2024. ISSN 0952-1976. doi:10.1016/j.engappai.2023.107232. URL https://doi.org/10.1016/j.engappai.2023.107232

work page doi:10.1016/j.engappai.2023.107232 2024
[85]

Z., Hryniewicz, O., Kaminska, O., Opara, K., Owsinski, J., Radziszewska, W., Sochacka, M., and Swiecicki, L

Kaczmarek-Majer, K., Dominiak, M., Antosik, A. Z., Hryniewicz, O., Kaminska, O., Opara, K., Owsinski, J., Radziszewska, W., Sochacka, M., and Swiecicki, L. Acoustic features from speech as markers of depressive and manic symptoms in bipolar disorder: A prospective study. Acta Psychiatrica Scandinavica, 151 0 (3): 0 358--374, 2025. doi:https://doi.org/10.1...

work page doi:10.1111/acps.13735 2025
[86]

N., Provost, E

Karam, Z. N., Provost, E. M., Singh, S., Montgomery, J., Archer, C., Harrington, G., and McInnis, M. G. Ecologically valid long-term mood monitoring of individuals with bipolar disorder using speech. 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.\ 4858--4862, 2014. URL https://api.semanticscholar.org/CorpusID:10425398

2014
[87]

Estimating Mutual Information

Kraskov, A., Stögbauer, H., and Grassberger, P. Estimating mutual information. Physical Review E, 69 0 (6), 2004. ISSN 1550-2376. doi:10.1103/physreve.69.066138. URL http://dx.doi.org/10.1103/PhysRevE.69.066138

work page doi:10.1103/physreve.69.066138 2004
[88]

CMSIS-NN: Efficient neural network kernels for ARM Cortex-M CPUs,

Lai, L., Suda, N., and Chandra, V. Cmsis-nn: Efficient neural network kernels for arm cortex-m cpus, 2018. URL https://arxiv.org/abs/1801.06601

work page arXiv 2018
[89]

Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis

Li, T., Wang, X., Xie, Q., Wang, Z., and Xie, L. Cross-speaker emotion disentangling and transfer for end-to-end speech synthesis. IEEE/ACM Trans. Audio, Speech and Lang. Proc., 30: 0 1448–1460, April 2022. ISSN 2329-9290. doi:10.1109/TASLP.2022.3164181. URL https://doi.org/10.1109/TASLP.2022.3164181

work page doi:10.1109/taslp.2022.3164181 2022
[90]

Towards one-bit asr: Extremely low-bit conformer quantization using co-training and stochastic precision, 2025

Li, Z., Xu, H., Jin, Z., Meng, L., Wang, T., Wang, H., Chen, Y., Cui, M., Hu, S., and Liu, X. Towards one-bit asr: Extremely low-bit conformer quantization using co-training and stochastic precision, 2025. URL https://arxiv.org/abs/2505.21245

work page arXiv 2025
[91]

Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls

Pan, W., Deng, F., Wang, X., Hang, B., Zhou, W., and Zhu, T. Exploring the ability of vocal biomarkers in distinguishing depression from bipolar disorder, schizophrenia, and healthy controls. Frontiers in Psychiatry, Volume 14 - 2023, 2023. ISSN 1664-0640. doi:10.3389/fpsyt.2023.1079448. URL https://www.frontiersin.org/journals/psychiatry/articles/10.3389...

work page doi:10.3389/fpsyt.2023.1079448 2023

Showing first 80 references.