arxiv: 2604.16579 · v2 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: 3 theorem links

· Lean Theorem

EviDep: Trustworthy Multimodal Depression Estimation via Disentangled Evidential Learning

Bin Luo, Enhong Chen, Fangyuan Liu, Feng-Qi Cui, Jinyang Huang, Meng Li, Sirui Zhao, Tong Xu, Zeyu Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multimodal depression estimationevidential learninguncertainty quantificationfeature disentanglementwavelet mixture of expertsmental health AIAVEC datasets

0 comments

The pith

EviDep estimates depression severity from video and audio while also reporting aleatoric and epistemic uncertainty to reduce overconfident predictions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes EviDep as an evidential learning framework that produces both a depression severity score and calibrated uncertainty estimates instead of single point predictions. It tackles naturalistic noise and behavioral variability in unconstrained settings by first using a frequency-aware extractor to separate stable affective baselines from transient bursts, then applying disentangled fusion that keeps shared cross-modal information separate from modality-specific details. This prevents the model from artificially inflating its confidence during multimodal combination. Experiments across four standard datasets show gains in predictive accuracy together with better uncertainty calibration, which matters because mental-health decisions require knowing when an estimate is reliable and when it is not.

Core claim

EviDep jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. A Frequency-aware Feature Extraction module with wavelet-based Mixture-of-Experts decouples macro-level affective baselines from micro-level behavioral bursts to filter task-irrelevant artifacts. A Disentangled Evidential Learning strategy then decorrelates cross-modal shared consensus from modality-specific nuances before Bayesian fusion, strictly preventing double-counting of overlapping information.

What carries the argument

The Disentangled Evidential Learning strategy, which separates shared consensus features from modality-specific nuances before evidential fusion, paired with the wavelet-based Mixture-of-Experts in the Frequency-aware Feature Extraction module.

If this is right

Provides risk-aware outputs that let downstream users weight high-uncertainty cases more cautiously.
Reduces double-counting of redundant multimodal information, yielding better-calibrated confidence.
Maintains state-of-the-art accuracy on AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC while adding uncertainty reporting.
Handles temporal-frequency heterogeneity in behavioral cues without manual feature engineering.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of shared and specific signals could be tested on other multimodal health tasks where overlapping cues across sensors risk overcounting.
If uncertainty tracks real clinical variability, the model could serve as a filter that flags cases needing human review before any automated recommendation.
Extending the wavelet Mixture-of-Experts to additional modalities such as text transcripts would test whether the frequency-decoupling benefit generalizes.

Load-bearing premise

That the wavelet Mixture-of-Experts successfully removes artifacts without discarding depression-relevant signals and that explicit decorrelation of shared and specific features prevents confidence inflation without losing useful information.

What would settle it

Finding that uncertainty estimates do not rise for incorrect predictions on a new test split of the E-DAIC dataset, or that removing the disentanglement step leaves accuracy and calibration unchanged.

Figures

Figures reproduced from arXiv: 2604.16579 by Bin Luo, Enhong Chen, Fangyuan Liu, Feng-Qi Cui, Jinyang Huang, Meng Li, Sirui Zhao, Tong Xu, Zeyu Zhang.

**Figure 2.** Figure 2: Overall architecture of the proposed EviDep framework. The framework sequentially processes audio-visual inputs through a Frequency-aware Feature [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Architecture of the Frequency-aware Feature Refinement mod [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed structure of the NIG Regression Head. The latent feature [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Average MAE and 90% bootstrap confidence intervals across four [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Sparsification Error Curve evaluated on the AVEC 2014 test set. [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Evidential predictive distributions and 90% CIs for four representative [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 9.** Figure 9: Performance and uncertainty awareness under varying feature missing [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 8.** Figure 8: Density distributions of the predicted Epistemic Uncertainty (EU) [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

read the original abstract

Automated multimodal depression estimation in unconstrained environments is inherently challenged by naturalistic noise and complex behavioral variability. Prevailing deterministic methods, however, produce uncalibrated point estimates without quantifying predictive uncertainty, exposing decision-making to the risk of overconfident, untrustworthy estimates. To establish a reliable and trustworthy estimation paradigm, we propose EviDep, an evidential learning framework that jointly quantifies depression severity alongside aleatoric and epistemic uncertainties via a Normal-Inverse-Gamma distribution. To ensure the integrity of the extracted behavioral evidence and prevent artificial confidence inflation during multimodal fusion, EviDep introduces two tailored mechanisms. First, addressing the temporal-frequency heterogeneity of behavioral cues, a Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts, effectively filtering out task-irrelevant artifacts. Second, a Disentangled Evidential Learning strategy enforces explicit decorrelation of features in these purified representations. By separating the cross-modal shared consensus from modality-specific behavioral nuances before Bayesian fusion, this rigorous disentanglement strictly prevents the model from double-counting overlapping information. Extensive experiments on the AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets confirm that EviDep achieves state-of-the-art predictive accuracy and superior uncertainty calibration, thereby delivering a trustworthy, risk-aware decision-support tool for depression estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EviDep adds wavelet MoE extraction and explicit disentanglement to evidential learning for multimodal depression estimation, but the SOTA and calibration claims need the full experimental details to land.

read the letter

The main takeaway is that this paper puts forward EviDep as a framework that quantifies both depression severity and its uncertainties using a Normal-Inverse-Gamma head, with two added pieces: a wavelet-based Mixture-of-Experts module to split macro affective baselines from micro bursts, and a disentanglement step that separates shared cross-modal features from modality-specific ones before fusion. That combination is the concrete novelty relative to prior evidential or multimodal depression work. It does a clear job stating the clinical risk of overconfident point estimates and targets two plausible failure modes in the data—temporal-frequency mixing and information overlap during fusion. The choice of standard datasets (AVEC 2013/2014, DAIC-WOZ, E-DAIC) also makes direct comparison possible. The architecture description is straightforward and the motivation for trustworthy outputs in mental-health settings is on target. The soft spots sit in the validation of the two mechanisms. The wavelet MoE is supposed to filter artifacts without discarding depression-relevant variance, and the disentanglement is meant to stop double-counting without attenuating useful signals; if either step is too aggressive or collapses, the uncertainty estimates could be miscalibrated even if raw accuracy looks good. The abstract states SOTA results and better calibration, but without seeing the ablation deltas, feature visualizations, or exact uncertainty metrics in the full paper, those gains remain unverified. No obvious circularity in the setup, and the evidential head follows established practice. This is for researchers in affective computing and multimodal health AI who care about uncertainty quantification. A reader already working on depression estimation or evidential methods would get the most out of the architecture and dataset results. It deserves peer review because the problem is real, the proposed fixes are specific, and the experiments cover multiple benchmarks; referees can pressure-test the mechanism ablations and calibration plots.

Referee Report

3 major / 3 minor

Summary. The manuscript proposes EviDep, a multimodal evidential learning framework for depression severity estimation from behavioral cues. It models predictions via a Normal-Inverse-Gamma distribution to jointly output severity scores along with aleatoric and epistemic uncertainties. Two core mechanisms are introduced: a Frequency-aware Feature Extraction module that employs a wavelet-based Mixture-of-Experts to separate macro-level affective baselines from micro-level bursts, and a Disentangled Evidential Learning strategy that explicitly decorrelates shared cross-modal consensus from modality-specific features prior to Bayesian fusion. Experiments across AVEC 2013, AVEC 2014, DAIC-WOZ, and E-DAIC datasets are reported to achieve state-of-the-art accuracy while delivering superior uncertainty calibration.

Significance. If the empirical claims hold, the work offers a substantive contribution to trustworthy multimodal learning for mental health applications by moving beyond uncalibrated point estimates to risk-aware predictions. The integration of evidential deep learning with domain-specific disentanglement for temporal-frequency heterogeneity addresses a genuine practical need. The multi-dataset evaluation provides a reasonable empirical foundation, and the explicit focus on preventing overcounting in fusion is a clear methodological strength.

major comments (3)

[§3.2] §3.2 (Frequency-aware Feature Extraction): The assertion that the wavelet-based Mixture-of-Experts successfully isolates task-irrelevant artifacts while preserving depression-relevant variance lacks supporting evidence such as expert activation maps, frequency-domain ablations, or quantitative comparison of retained signal variance before/after the module. Without these, the downstream claim of trustworthy uncertainty quantification cannot be fully evaluated.
[§3.3] §3.3 and §4.3 (Disentangled Evidential Learning and ablations): The decorrelation loss is presented as strictly preventing double-counting of overlapping information, yet no ablation isolating its effect on both predictive accuracy (e.g., RMSE/MAE deltas) and calibration metrics (e.g., ECE or NLL) is reported. If the loss is too aggressive it could attenuate modality-specific severity cues, directly undermining the central trustworthiness argument.
[Table 2] Table 2 / §4.2 (main results): The SOTA accuracy and calibration claims rest on the two unverified functional assumptions identified above. Direct comparisons against strong multimodal baselines with uncertainty heads (e.g., MC-dropout or deep ensembles) and explicit reporting of uncertainty calibration curves or sharpness metrics would be required to substantiate superiority.

minor comments (3)

[§3.1] Notation for the Normal-Inverse-Gamma parameters (e.g., the four output heads) should be introduced with a single consolidated equation rather than scattered across subsections to improve readability.
The manuscript would benefit from a dedicated limitations paragraph discussing potential failure modes when behavioral cues are sparse or when one modality is missing.
[Figure 3] Figure 3 (architecture diagram) would be clearer with explicit arrows indicating the flow of the decorrelation loss and the final evidential fusion step.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We have carefully reviewed each major comment and provide point-by-point responses below. We agree that additional empirical evidence will strengthen the manuscript and will incorporate the requested analyses in the revised version.

read point-by-point responses

Referee: [§3.2] §3.2 (Frequency-aware Feature Extraction): The assertion that the wavelet-based Mixture-of-Experts successfully isolates task-irrelevant artifacts while preserving depression-relevant variance lacks supporting evidence such as expert activation maps, frequency-domain ablations, or quantitative comparison of retained signal variance before/after the module. Without these, the downstream claim of trustworthy uncertainty quantification cannot be fully evaluated.

Authors: We acknowledge that the current description of the Frequency-aware Feature Extraction module would benefit from direct empirical validation of its separation capabilities. In the revised manuscript, we will include expert activation maps, frequency-domain ablation results, and quantitative comparisons of retained signal variance (pre- and post-module) to demonstrate that depression-relevant features are preserved while task-irrelevant artifacts are attenuated. revision: yes
Referee: [§3.3] §3.3 and §4.3 (Disentangled Evidential Learning and ablations): The decorrelation loss is presented as strictly preventing double-counting of overlapping information, yet no ablation isolating its effect on both predictive accuracy (e.g., RMSE/MAE deltas) and calibration metrics (e.g., ECE or NLL) is reported. If the loss is too aggressive it could attenuate modality-specific severity cues, directly undermining the central trustworthiness argument.

Authors: We agree that an isolated ablation of the decorrelation loss is necessary to fully substantiate its contribution. Although §4.3 contains related ablations, they do not isolate the loss's impact on both accuracy and calibration. In the revision, we will add a dedicated ablation table reporting RMSE, MAE, ECE, and NLL deltas with and without the decorrelation loss, confirming that it improves calibration without unduly suppressing modality-specific cues. revision: yes
Referee: [Table 2] Table 2 / §4.2 (main results): The SOTA accuracy and calibration claims rest on the two unverified functional assumptions identified above. Direct comparisons against strong multimodal baselines with uncertainty heads (e.g., MC-dropout or deep ensembles) and explicit reporting of uncertainty calibration curves or sharpness metrics would be required to substantiate superiority.

Authors: We appreciate the suggestion to benchmark against additional uncertainty-aware multimodal methods. To strengthen the empirical claims, the revised manuscript will include direct comparisons to strong baselines such as MC-dropout and deep ensembles applied to the same multimodal inputs. We will also report uncertainty calibration curves and sharpness metrics alongside the existing results to provide a more complete evaluation of calibration quality. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation chain not inspectable via equations

full rationale

The visible abstract and framework description introduce modules (wavelet-based MoE for frequency-aware extraction and explicit decorrelation in disentangled evidential learning) but present no equations, parameter-fitting steps, or derivation chains that reduce outputs to inputs by construction. Claims of SOTA accuracy and calibration rest on experiments across external datasets (AVEC 2013/2014, DAIC-WOZ, E-DAIC), which constitute independent benchmarks rather than self-referential fits. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner within the provided text. This is the common case of a self-contained empirical proposal without mathematical circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, so the ledger is necessarily incomplete. The approach relies on standard deep-learning assumptions plus two domain-specific premises about feature separability; no new physical entities are introduced.

axioms (2)

domain assumption Behavioral signals exhibit temporal-frequency heterogeneity that can be decoupled by wavelet-based Mixture-of-Experts into stable baselines and transient bursts.
Directly invoked to justify the Frequency-aware Feature Extraction module.
domain assumption Multimodal representations contain separable cross-modal consensus and modality-specific nuances that can be explicitly decorrelated before fusion.
Central premise of the Disentangled Evidential Learning strategy.

pith-pipeline@v0.9.0 · 5578 in / 1436 out tokens · 43633 ms · 2026-05-11T01:41:53.766385+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Frequency-aware Feature Extraction module leverages a wavelet-based Mixture-of-Experts to dynamically decouple stable macro-level affective baselines from transient micro-level behavioral bursts
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Disentangled Evidential Learning strategy enforces explicit decorrelation of features... orthogonality and consistency constraints

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages

[1]

What reveals about depression level? the role of multimodal features at the level of interview questions,

S. Guohou, Z. Lina, and Z. Dongsong, “What reveals about depression level? the role of multimodal features at the level of interview questions,”Information and Management, vol. 57, no. 7, p. 103349, 2020. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0378720620302871

work page 2020
[2]

Deep learning for depression recognition with audiovisual cues: A review,

L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang, C. Guo, H. Wang, S. Ding, Z. Wang, X. Pan, and W. Dang, “Deep learning for depression recognition with audiovisual cues: A review,”Information Fusion, vol. 80, pp. 56–86, Apr. 2022

work page 2022
[3]

Toward a critical evaluation of robustness for deep learning backdoor countermeasures,

H. Qiu, H. Ma, Z. Zhang, A. Abuadbba, W. Kang, A. Fu, and Y . Gao, “Toward a critical evaluation of robustness for deep learning backdoor countermeasures,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 455–468, 2024

work page 2024
[4]

Second opinion needed: communicating uncertainty in medical machine learning,

B. Kompa, J. Snoek, and A. L. Beam, “Second opinion needed: communicating uncertainty in medical machine learning,”npj Digital Medicine, vol. 4, no. 1, p. 4, 2021

work page 2021
[5]

Towards trustworthy ai in healthcare: Epistemic uncertainty estimation for clinical decision support,

A. Lindenmeyer, M. Blattmann, S. Franke, T. Neumuth, and D. Schneider, “Towards trustworthy ai in healthcare: Epistemic uncertainty estimation for clinical decision support,”Journal of Personalized Medicine, vol. 15, no. 2, 2025. [Online]. Available: https://www.mdpi.com/2075-4426/15/2/58

work page 2025
[6]

Trustworthy dataset proof: Certifying the authentic use of dataset in training models for enhanced trust,

Z. Sun, L. Liu, Z. Li, T. Wang, Z. Sui, N. Ruan, C. He, D. Lin, and J. Li, “Trustworthy dataset proof: Certifying the authentic use of dataset in training models for enhanced trust,”IEEE Transactions on Information Forensics and Security, vol. 21, pp. 1902–1913, 2026

work page 1902
[7]

Maximizing uncertainty for federated learning via bayesian optimization-based model poisoning,

M. Aristodemou, X. Liu, Y . Wang, K. G. Kyriakopoulos, S. Lamboth- aran, and Q. Wei, “Maximizing uncertainty for federated learning via bayesian optimization-based model poisoning,”IEEE Transactions on Information Forensics and Security, vol. 20, pp. 2399–2411, 2025

work page 2025
[8]

Audio-visual feature disentanglement and fusion network for automatic depression severity prediction,

S. Li, Z. Shao, R. Qin, Y . Huang, P. Liang, X. Li, Y . Jiang, Y . Deng, T. Liu, and X. Tan, “Audio-visual feature disentanglement and fusion network for automatic depression severity prediction,”IEEE Transac- tions on Affective Computing, 2025

work page 2025
[9]

Mlm-eoe: Automatic depression detection via sentimental annotation and multi-expert ensem- ble,

Z. Lin, Y . Wang, Y . Zhou, F. Du, and Y . Yang, “Mlm-eoe: Automatic depression detection via sentimental annotation and multi-expert ensem- ble,”IEEE Transactions on Affective Computing, 2025

work page 2025
[10]

Conformal depression prediction,

Y . Li, S. Qu, and X. Zhou, “Conformal depression prediction,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1814–1824, 2025

work page 2025
[11]

Fair Uncertainty Quantification for Depression Prediction,

Y . Li, Z. Zhang, and X. Zhou, “Fair Uncertainty Quantification for Depression Prediction,” Sep. 2025

work page 2025
[12]

U-fair: Uncertainty- based multimodal multitask learning for fairer depression detection,

J. Cheong, A. Bangar, S. Kalkan, and H. Gunes, “U-fair: Uncertainty- based multimodal multitask learning for fairer depression detection,” in Proceedings of the 4th Machine Learning for Health Symposium, ser. Proceedings of Machine Learning Research, S. Hegselmann, H. Zhou, E. Healey, T. Chang, C. Ellington, V . Mhasawade, S. Tonekaboni, P. Argaw, and H. ...

work page 2025
[13]

Daily affective dynamics predict depression symptom trajectories among adults with major and minor depression,

V . Panaite, J. Rottenberg, and L. M. Bylsma, “Daily affective dynamics predict depression symptom trajectories among adults with major and minor depression,”Affective Science, vol. 1, no. 3, pp. 186–198, 2020

work page 2020
[14]

Trustworthy multimodal regression with mixture of normal-inverse gamma distribu- tions,

H. Ma, Z. Han, C. Zhang, H. Fu, J. T. Zhou, and Q. Hu, “Trustworthy multimodal regression with mixture of normal-inverse gamma distribu- tions,”Advances in Neural Information Processing Systems, vol. 34, pp. 6881–6893, 2021

work page 2021
[15]

The evidence contraction issue in deep evidential regression: Discussion and solution,

Y . Wu, B. Shi, B. Dong, Q. Zheng, and H. Wei, “The evidence contraction issue in deep evidential regression: Discussion and solution,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, pp. 21 726–21 734, Mar. 2024. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/30172

work page 2024
[16]

Deep learning for depression recognition with audiovisual cues: A review,

L. He, M. Niu, P. Tiwari, P. Marttinen, R. Su, J. Jiang, C. Guo, H. Wang, S. Ding, Z. Wang, X. Pan, and W. Dang, “Deep learning for depression recognition with audiovisual cues: A review,” Information Fusion, vol. 80, pp. 56–86, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1566253521002207

work page 2022
[17]

Multimodal spatiotem- poral representation for automatic depression level detection,

M. Niu, J. Tao, B. Liu, J. Huang, and Z. Lian, “Multimodal spatiotem- poral representation for automatic depression level detection,”IEEE transactions on affective computing, vol. 14, no. 1, pp. 294–307, 2020

work page 2020
[18]

Transformer-based multimodal feature enhancement networks for mul- timodal depression detection integrating video, audio and remote photo- plethysmograph signals,

H. Fan, X. Zhang, Y . Xu, J. Fang, S. Zhang, X. Zhao, and J. Yu, “Transformer-based multimodal feature enhancement networks for mul- timodal depression detection integrating video, audio and remote photo- plethysmograph signals,”Information Fusion, vol. 104, p. 102161, 2024

work page 2024
[19]

Integrating Deep Facial Priors Into Landmarks for Privacy Preserving Multimodal Depression Recognition,

Y . Pan, Y . Shang, Z. Shao, T. Liu, G. Guo, and H. Ding, “Integrating Deep Facial Priors Into Landmarks for Privacy Preserving Multimodal Depression Recognition,”IEEE Transactions on Affective Computing, vol. 15, no. 3, pp. 828–836, Jul. 2024

work page 2024
[20]

Depformer: A unified framework with bimodal collaborative transformer for depression detection,

F. Liu, S. Zhao, K. Yin, T. Xu, and E. Chen, “Depformer: A unified framework with bimodal collaborative transformer for depression detection,” inProceedings of the 33rd ACM International Conference on Multimedia, ser. MM ’25. New York, NY , USA: Association for Computing Machinery, 2025, p. 13930–13936. [Online]. Available: https://doi.org/10.1145/3746027.3762062

work page doi:10.1145/3746027.3762062 2025
[21]

Automatic depression recognition with an ensemble of multimodal spatio-temporal routing features,

Y . Wang, Z. Lin, C. Yang, Y . Zhou, and Y . Yang, “Automatic depression recognition with an ensemble of multimodal spatio-temporal routing features,”IEEE Transactions on Affective Computing, 2025

work page 2025
[22]

Uncertainty-aware label contrastive distribution learning for automatic depression detection,

B. Yang, P. Wang, M. Cao, X. Zhu, S. Wang, R. Ni, and C. Yang, “Uncertainty-aware label contrastive distribution learning for automatic depression detection,”IEEE Transactions on Computational Social Systems, vol. 11, no. 2, pp. 2979–2989, 2024

work page 2024
[23]

Deep evidential regression,

A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 14 927– 14 937. [Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2020/file/aab085461de182...

work page 2020
[24]

A comprehensive survey on evidential deep learning and its applications,

J. Gao, M. Chen, L. Xiang, and C. Xu, “A comprehensive survey on evidential deep learning and its applications,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 3, pp. 2118– 2138, 2026

work page 2026
[25]

A VEC 2013: The Continu- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 ous Audio/Visual Emotion and Depression Recognition Challenge,

M. Valstar, B. Schuller, K. Smith, F. Eyben, B. Jiang, S. Bilakhia, S. Schnieder, R. Cowie, and M. Pantic, “A VEC 2013: The Continu- JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13 ous Audio/Visual Emotion and Depression Recognition Challenge,” in Proceedings of the 3rd ACM International Workshop on Audio/Visual Emotion Challenge, 2013, pp. 3–10

work page 2013
[26]

A VEC 2014: 3D Dimensional Affect and De- pression Recognition Challenge,

M. Valstar, B. Schuller, K. Smith, T. Almaev, F. Eyben, J. Krajewski, R. Cowie, and M. Pantic, “A VEC 2014: 3D Dimensional Affect and De- pression Recognition Challenge,” inProceedings of the 4th International Workshop on Audio/Visual Emotion Challenge, 2014, pp. 3–10

work page 2014
[27]

The distress analysis interview corpus of human and computer interviews,

J. Gratch, R. Artstein, G. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsella, D. Traum, S. Rizzo, and L.-P. Morency, “The distress analysis interview corpus of human and computer interviews,” inProceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), N. Calzolari, K. Choukri, T...

work page 2014
[28]

A VEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition,

F. Ringeval, B. Schuller, M. Valstar, N. Cummins, R. Cowie, L. Tavabi, M. Schmitt, S. Alisamir, S. Amiriparian, E.-M. Messneret al., “A VEC 2019 Workshop and Challenge: State-of-Mind, Detecting Depression with AI, and Cross-Cultural Affect Recognition,” inProceedings of the 9th International Workshop on Audio/Visual Emotion Challenge and Workshop, 2019, pp. 3–12

work page 2019
[29]

WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition,

Z. Zhu, G. Huang, J. Deng, G. Ye, J. Chen, J. Li, J. Tian, W. Du, X. Zhou, J. Liuet al., “WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 492–10 502

work page 2021
[30]

Silero vad: pre-trained enterprise-grade voice activity detec- tor (vad), number detector and language classifier,

S. Team, “Silero vad: pre-trained enterprise-grade voice activity detec- tor (vad), number detector and language classifier,” https://github.com/ snakers4/silero-vad, 2024

work page 2024
[31]

Multimodal spatiotem- poral representation for automatic depression level detection,

M. Niu, J. Tao, B. Liu, J. Huang, and Z. Lian, “Multimodal spatiotem- poral representation for automatic depression level detection,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 294–307, 2023

work page 2023
[32]

Ttfnet: Temporal- frequency features fusion network for speech based automatic depression recognition and assessment,

X. Chen, Z. Shao, Y . Jiang, R. Chen, Y . Wang, B. Li, M. Niu, H. Chen, Q. Hu, J. Wu, C. Yang, and Y . Shang, “Ttfnet: Temporal- frequency features fusion network for speech based automatic depression recognition and assessment,”IEEE Journal of Biomedical and Health Informatics, vol. 29, no. 10, pp. 7536–7548, 2025

work page 2025
[33]

Dense coordinate channel attention network for depression level estimation from speech,

Z. Zhao, S. Liu, M. Niu, H. Wang, and B. W. Schuller, “Dense coordinate channel attention network for depression level estimation from speech,” inInternational Conference on Pattern Recognition. Springer, 2024, pp. 402–413

work page 2024
[34]

A time-frequency channel attention and vectorization network for automatic depression level prediction,

M. Niu, B. Liu, J. Tao, and Q. Li, “A time-frequency channel attention and vectorization network for automatic depression level prediction,” Neurocomputing, vol. 450, pp. 208–218, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231221005981

work page 2021
[35]

Wavdepressionnet: Automatic depression level prediction via raw speech signals,

M. Niu, J. Tao, Y . Li, Y . Qin, and Y . Li, “Wavdepressionnet: Automatic depression level prediction via raw speech signals,”IEEE Transactions on Affective Computing, vol. 15, no. 1, pp. 285–296, 2023

work page 2023
[36]

A deep multiscale spa- tiotemporal network for assessing depression from facial dynamics,

W. C. De Melo, E. Granger, and A. Hadid, “A deep multiscale spa- tiotemporal network for assessing depression from facial dynamics,” IEEE transactions on affective computing, vol. 13, no. 3, pp. 1581– 1592, 2020

work page 2020
[37]

Two-stage temporal modelling framework for video-based depression recognition using graph representation,

J. Xu, H. Gunes, K. Kusumam, M. Valstar, and S. Song, “Two-stage temporal modelling framework for video-based depression recognition using graph representation,”IEEE Transactions on Affective Computing, vol. 16, no. 1, pp. 161–178, 2025

work page 2025
[38]

Depressformer: Leveraging video swin transformer and fine-grained local features for depression scale estimation,

L. He, Z. Li, P. Tiwari, C. Cao, J. Xue, F. Zhu, and D. Wu, “Depressformer: Leveraging video swin transformer and fine-grained local features for depression scale estimation,”Biomedical Signal Processing and Control, vol. 96, p. 106490, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1746809424005482

work page 2024
[39]

Feddaam: Feder- ated domain adversarial learning with attention mechanism for privacy preserving multimodal depression assessment,

L. He, W. Yang, J. Zhao, H. Chen, and D. Jiang, “Feddaam: Feder- ated domain adversarial learning with attention mechanism for privacy preserving multimodal depression assessment,”IEEE Transactions on Circuits and Systems for Video Technology, 2025

work page 2025
[40]

Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition,

Y . Pan, Y . Shang, Z. Shao, T. Liu, G. Guo, and H. Ding, “Integrating deep facial priors into landmarks for privacy preserving multimodal depression recognition,”IEEE Transactions on Affective Computing, vol. 15, no. 3, pp. 828–836, 2023

work page 2023
[41]

Disentangled-multimodal privileged knowledge distillation for depression recognition with incom- plete multimodal data,

Y . Pan, J. Jiang, K. Jiang, and X. Liu, “Disentangled-multimodal privileged knowledge distillation for depression recognition with incom- plete multimodal data,” inProceedings of the 32nd ACM international conference on multimedia, 2024, pp. 5712–5721

work page 2024
[42]

Mfmamba: A multimodal fusion state space model for depression recognition,

J. Liu, Y . Shang, M. Yang, Z. Shao, J. Lu, and T. Liu, “Mfmamba: A multimodal fusion state space model for depression recognition,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[43]

Deep multi-modal network based automated depression severity estimation,

M. A. Uddin, J. B. Joolee, and K.-A. Sohn, “Deep multi-modal network based automated depression severity estimation,”IEEE Transactions on Affective Computing, vol. 14, no. 3, pp. 2153–2167, 2023

work page 2023
[44]

Ste-mamba: Auto- mated multimodal depression detection through emotional analysis and spatio-temporal information ensemble,

Z. Lin, Y . Wang, Y . Zhou, F. Du, and Y . Yang, “Ste-mamba: Auto- mated multimodal depression detection through emotional analysis and spatio-temporal information ensemble,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

work page 2025
[45]

Spatial–temporal feature network for speech-based depression recog- nition,

Z. Han, Y . Shang, Z. Shao, J. Liu, G. Guo, T. Liu, H. Ding, and Q. Hu, “Spatial–temporal feature network for speech-based depression recog- nition,”IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 1, pp. 308–318, 2024

work page 2024
[46]

Multitask representation learning for multimodal estimation of depression level,

S. A. Qureshi, S. Saha, M. Hasanuzzaman, and G. Dias, “Multitask representation learning for multimodal estimation of depression level,” IEEE Intelligent Systems, vol. 34, no. 5, pp. 45–52, 2019

work page 2019
[47]

Attention-guided bi-direction temporal-aware network for speech- based depression recognition,

J. Liu, Y . Shang, M. Yang, Z. Shao, H. Ding, and T. Liu, “Attention-guided bi-direction temporal-aware network for speech- based depression recognition,”Digital Signal Processing, vol. 166, p. 105359, 2025. [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S1051200425003811

work page 2025
[48]

Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech,

Y . Pan, Y . Shang, W. Wang, Z. Shao, Z. Han, T. Liu, G. Guo, and H. Ding, “Multi-feature deep supervised voiceprint adversarial network for depression recognition from speech,”Biomedical Signal Processing and Control, vol. 89, p. 105704, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1746809423011370

work page 2024
[49]

Enhanced depression detection from facial cues using univariate feature selection techniques,

S. Rathi, B. Kaur, and R. K. Agrawal, “Enhanced depression detection from facial cues using univariate feature selection techniques,” inPattern Recognition and Machine Intelligence, B. Deka, P. Maji, S. Mitra, D. K. Bhattacharyya, P. K. Bora, and S. K. Pal, Eds. Cham: Springer International Publishing, 2019, pp. 22–29

work page 2019
[50]

A random forest regression method with selected-text feature for depression as- sessment,

B. Sun, Y . Zhang, J. He, L. Yu, Q. Xu, D. Li, and Z. Wang, “A random forest regression method with selected-text feature for depression as- sessment,” inProceedings of the 7th annual workshop on Audio/Visual emotion challenge, 2017, pp. 61–68

work page 2017
[51]

Facial action units guided graph representation learning for multimodal depression detection,

C. Fu, F. Qian, Y . Su, K. Su, S. Song, M. Niu, J. Shi, Z. Liu, C. Liu, C. T. Ishi, and H. Ishiguro, “Facial action units guided graph representation learning for multimodal depression detection,” Neurocomputing, vol. 619, p. 129106, 2025. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231224018770

work page 2025
[52]

Calibrated reliable regression using maximum mean discrepancy,

P. Cui, W. Hu, and J. Zhu, “Calibrated reliable regression using maximum mean discrepancy,” inAdvances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020. [Online]. Available: https...

work page 2020
[53]

A quantile regression neural network approach to es- timating the conditional density of multiperiod returns,

J. W. Taylor, “A quantile regression neural network approach to es- timating the conditional density of multiperiod returns,”journal of forecasting, vol. 19, no. 4, pp. 299–311, 2000

work page 2000
[54]

Conformalized quantile regression,

Y . Romano, E. Patterson, and E. Candes, “Conformalized quantile regression,” inAdvances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch ´e-Buc, E. Fox, and R. Garnett, Eds., vol. 32. Curran Associates, Inc., 2019. [Online]. Available: https://proceedings.neurips.cc/paper files/paper/2019/file/5103c3584b063c4...

work page 2019
[55]

Conformal prediction using conditional histograms,

M. Sesia and Y . Romano, “Conformal prediction using conditional histograms,” inAdvances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 6304–

work page 2021
[56]

Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/31b3b31a1c2f8a370206f111127c0dbd-Paper.pdf

[Online]. Available: https://proceedings.neurips.cc/paper files/ paper/2021/file/31b3b31a1c2f8a370206f111127c0dbd-Paper.pdf

work page 2021