Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

Eun-Jung Holden; Siyi Wang; Ting Dang; Yang Xiao

read the original abstract

Speech and audio systems operate in inherently non-stationary environments, yet continual learning (CL) research in this domain, especially in the foundation model era, remains fragmented that fail to account for the coupled, geometry-sensitive nature of acoustic representations. Modern speech foundation models operate over highly entangled, continuous representations that jointly encode linguistic, speaker, and paralinguistic factors within a shared latent space. CL is therefore fundamentally about preserving and evolving shared representation structure rather than retaining isolated task knowledge. In this work, we revisit CL for speech from a representation-centered perspective, and introduce a new taxonomy that organizes CL according to how underlying representation geometry evolves under non-stationary acoustic conditions. We further identify key mismatches between current CL assumptions and speech foundation model behavior, and finally outline a set of open challenges and future research directions.

Rethinking Continual Learning for Speech and Audio: A Representation-Centric Taxonomy and Open Problems

discussion (0)