Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 9representative citing papers
A factor-partitioned embedding framework maps speech utterances to vectors with subspaces for distinct attributes, supporting signed weighted similarity retrieval that can suppress or emphasize specific factors like speaker identity.
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Phonological subspace collapse in SSL speech representations produces aetiology-specific degradation profiles that remain stable in shape across languages and model architectures.
Fine-tuning on annotated English and Japanese dialogues improves clustering of backchannels and fillers and makes generated utterances closer to human ones.
citing papers explorer
-
Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
-
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings
A factor-partitioned embedding framework maps speech utterances to vectors with subspaces for distinct attributes, supporting signed weighted similarity retrieval that can suppress or emphasize specific factors like speaker identity.
-
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
-
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
-
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
-
Sessa: Selective State Space Attention
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
-
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
-
Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers
Phonological subspace collapse in SSL speech representations produces aetiology-specific degradation profiles that remain stable in shape across languages and model architectures.
-
Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models
Fine-tuning on annotated English and Japanese dialogues improves clustering of backchannels and fillers and makes generated utterances closer to human ones.