Lip Forcing distills a 14B bidirectional video diffusion teacher into autoregressive students that achieve real-time lip synchronization at 31 FPS using two denoising steps without CFG.
hub
URL http://proceedings.mlr.press/ v37/allamanis15.html
19 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
other 1polarities
unclear 1representative citing papers
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Voxtral Realtime is an end-to-end trained streaming ASR model that achieves Whisper-level transcription quality at 480ms delay after scaling pretraining across 13 languages.
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Codex achieves 28.8% pass@1 on HumanEval, rising to 70.2% with 100 samples per problem via repeated sampling.
wav2VOT shows wav2vec2 can estimate voice onset time and related stop consonant features with accuracy comparable to existing tools on unseen data and higher accuracy after fine-tuning.
CoughPhase-CLR uses cough physiological phases to build contrastive positive pairs, outperforming random cropping on downstream tasks including COVID-19 detection and COPD classification.
F3-Tokenizer adapts audio autoencoder latents with noise-regularized bottleneck (channel normalization and stochastic perturbation) and a representation encoder (RQ-MTP plus frozen-LLM supervision) to support both high-dimensional understanding representations and normalized continuous generation ta
CAFNet performs joint ternary classification and temporal boundary regression for half-truth audio deepfakes via cross-attentive fusion of MFCC, LFCC, and Chroma-STFT features, reporting 92.71% accuracy and 0.075s MAE on MLADDC T2+T3.
A hybrid confidence-aware ASR training framework with learnable weights reduces Telugu medical WER from 24.3% to 15.8% and Kannada from 31.7% to 25.4%, outperforming standard fine-tuning.
Machine interpreting should shift from fidelity metrics to three design priorities—agency, grounding, and experience—drawn from interpreting studies to close the usability gap with human-mediated communication.
Fine-tuned Wav2Vec2 and HuBERT models recognize click consonants more accurately than non-clicks in G|ui and West !Xoon data.
Domain-adapted ECG foundation models with self-supervised pretraining and selective fine-tuning reach macro-AUROC 0.8509 for multi-label structural heart disease detection on the EchoNext benchmark.
SpAArSIST sparsifies AASIST by swapping learned pooling for explicit magnitude-based scoring and mean aggregation, cutting compute 20.7% and improving In-the-Wild EER to 2.82%.
A survey of spatial speech perception systems covering sound source localization, directional enhancement, and automatic speech recognition methods and their integration.
Empirical comparison of LSTM, GNN, and Transformer architectures for NBA trajectory forecasting finds hybrid LSTM with contextual information yields lowest FDE of 1.51m over horizons up to 2s.
citing papers explorer
No citing papers match the current filters.