pith. sign in

wav2vec 2.0: A framework for self-supervised learning of speech representations

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

citation-role summary

other 1

citation-polarity summary

roles

other 1

polarities

unclear 1

representative citing papers

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

Voxtral Realtime

cs.AI · 2026-02-11 · unverdicted · novelty 6.0

Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.

Step-Audio 2 Technical Report

cs.CL · 2025-07-22 · unverdicted · novelty 6.0

Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.

citing papers explorer

Showing 9 of 9 citing papers.