wav2vec 2.0: A framework for self-supervised learning of speech representations

PMLR · 2019 · arXiv 2006.11477

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

MMTalker combines multi-resolution mesh sampling with residual graph convolutions and dual cross-attention to synthesize accurate 3D talking head motions from audio.

Voxtral Realtime

cs.AI · 2026-02-11 · unverdicted · novelty 6.0

Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

Atlas: Few-shot Learning with Retrieval Augmented Language Models

cs.CL · 2022-08-05 · unverdicted · novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

Unsupervised Dense Information Retrieval with Contrastive Learning

cs.IR · 2021-12-16 · unverdicted · novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

Evaluating Large Language Models Trained on Code

cs.LG · 2021-07-07 · accept · novelty 6.0

Codex achieves 28.8% pass@1 on HumanEval, rising to 70.2% with 100 samples per problem via repeated sampling.

Enhancing ASR Performance in the Medical Domain for Dravidian Languages

eess.AS · 2026-04-10 · unverdicted · novelty 5.0

A hybrid confidence-aware ASR training framework with learnable weights reduces Telugu medical WER from 24.3% to 15.8% and Kannada from 31.7% to 25.4%, outperforming standard fine-tuning.

Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening

cs.LG · 2026-04-25 · unverdicted · novelty 4.0

Domain-adapted ECG foundation models with self-supervised pretraining and selective fine-tuning reach macro-AUROC 0.8509 for multi-label structural heart disease detection on the EchoNext benchmark.

citing papers explorer

Showing 9 of 9 citing papers.

Moshi: a speech-text foundation model for real-time dialogue eess.AS · 2024-09-17 · accept · none · ref 3
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
MMTalker: Multiresolution 3D Talking Head Synthesis with Multimodal Feature Fusion cs.CV · 2026-04-03 · unverdicted · none · ref 38
MMTalker combines multi-resolution mesh sampling with residual graph convolutions and dual cross-attention to synthesize accurate 3D talking head motions from audio.
Voxtral Realtime cs.AI · 2026-02-11 · unverdicted · none · ref 2
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
Step-Audio 2 Technical Report cs.CL · 2025-07-22 · unverdicted · none · ref 3
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
Atlas: Few-shot Learning with Retrieval Augmented Language Models cs.CL · 2022-08-05 · unverdicted · none · ref 119
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Unsupervised Dense Information Retrieval with Contrastive Learning cs.IR · 2021-12-16 · unverdicted · none · ref 91
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Evaluating Large Language Models Trained on Code cs.LG · 2021-07-07 · accept · none · ref 1
Codex achieves 28.8% pass@1 on HumanEval, rising to 70.2% with 100 samples per problem via repeated sampling.
Enhancing ASR Performance in the Medical Domain for Dravidian Languages eess.AS · 2026-04-10 · unverdicted · none · ref 16
A hybrid confidence-aware ASR training framework with learnable weights reduces Telugu medical WER from 24.3% to 15.8% and Kannada from 31.7% to 25.4%, outperforming standard fine-tuning.
Domain-Adapted Fine-Tuning of ECG Foundation Models for Multi-Label Structural Heart Disease Screening cs.LG · 2026-04-25 · unverdicted · none · ref 20
Domain-adapted ECG foundation models with self-supervised pretraining and selective fine-tuning reach macro-AUROC 0.8509 for multi-label structural heart disease detection on the EchoNext benchmark.

wav2vec 2.0: A framework for self-supervised learning of speech representations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer