HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, Abdelrahman Mohamed · 2021 · arXiv 2021.312229

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs

cs.CL · 2025-12-18 · unverdicted · novelty 7.0

Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.

Multi-Axis Speech Similarity via Factor-Partitioned Embeddings

eess.AS · 2026-05-04 · unverdicted · novelty 6.0

A factor-partitioned embedding framework maps speech utterances to vectors with subspaces for distinct attributes, supporting signed weighted similarity retrieval that can suppress or emphasize specific factors like speaker identity.

Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.

EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents

cs.CV · 2026-04-19 · unverdicted · novelty 6.0

EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.

SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation

cs.CL · 2025-12-24 · unverdicted · novelty 6.0

SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.

Sessa: Selective State Space Attention

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.

A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification

cs.CL · 2025-12-08 · unverdicted · novelty 5.0

Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.

Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers

cs.CL · 2026-04-23 · unverdicted · novelty 4.0

Phonological subspace collapse in SSL speech representations produces aetiology-specific degradation profiles that remain stable in shape across languages and model architectures.

Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models

cs.CL · 2025-09-24 · unverdicted · novelty 4.0

Fine-tuning on annotated English and Japanese dialogues improves clustering of backchannels and fillers and makes generated utterances closer to human ones.

citing papers explorer

Showing 9 of 9 citing papers.

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs cs.CL · 2025-12-18 · unverdicted · none · ref 40
Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.
Multi-Axis Speech Similarity via Factor-Partitioned Embeddings eess.AS · 2026-05-04 · unverdicted · none · ref 15
A factor-partitioned embedding framework maps speech utterances to vectors with subspaces for distinct attributes, supporting signed weighted similarity retrieval that can suppress or emphasize specific factors like speaker identity.
Beyond Decodability: Reconstructing Language Model Representations with an Encoding Probe cs.CL · 2026-05-01 · unverdicted · none · ref 25
An encoding probe reconstructs transformer representations from acoustic, phonetic, syntactic, lexical and speaker features, showing independent syntactic/lexical contributions and training-dependent speaker effects.
EmbodiedHead: Real-Time Listening and Speaking Avatar for Conversational Agents cs.CV · 2026-04-19 · unverdicted · none · ref 15
EmbodiedHead introduces a Rectified-Flow Diffusion Transformer with differentiable renderer and single-stream listening-speaking conditioning to achieve real-time high-fidelity conversational avatars.
SpidR-Adapt: A Universal Speech Representation Model for Few-Shot Adaptation cs.CL · 2025-12-24 · unverdicted · none · ref 21
SpidR-Adapt uses meta-learning with a first-order bi-level optimization heuristic to adapt speech representations to new languages with less than 1 hour of data, achieving 100x better efficiency than standard training.
Sessa: Selective State Space Attention cs.LG · 2026-04-20 · unverdicted · none · ref 22
Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
A Simple Method to Enhance Pre-trained Language Models with Speech Tokens for Classification cs.CL · 2025-12-08 · unverdicted · none · ref 16
Lasso-selected speech tokens enhance text LLMs for multimodal classification by reducing long audio sequences to task-relevant features via self-supervised adaptation.
Phonological Subspace Collapse Is Aetiology-Specific and Cross-Lingually Stable: Evidence from 3,374 Speakers cs.CL · 2026-04-23 · unverdicted · none · ref 16
Phonological subspace collapse in SSL speech representations produces aetiology-specific degradation profiles that remain stable in shape across languages and model architectures.
Investigating the Representation of Backchannels and Fillers in Fine-tuned Language Models cs.CL · 2025-09-24 · unverdicted · none · ref 28
Fine-tuning on annotated English and Japanese dialogues improves clustering of backchannels and fillers and makes generated utterances closer to human ones.

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer