hub

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition

· 2022 · arXiv 2206.08317

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

read on arXiv browse 11 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 2

citation-polarity summary

use method 2

representative citing papers

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs

cs.CL · 2026-03-20 · unverdicted · novelty 7.0

OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

cs.SD · 2026-04-02 · unverdicted · novelty 6.0

FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.

Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction

cs.CL · 2025-02-17 · unverdicted · novelty 6.0

Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.

Speech-based Psychological Crisis Assessment using LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

LLM system with paralinguistic cue injection and auxiliary reasoning training reaches 0.802 macro F1 and 0.805 accuracy on three-class speech-based crisis level classification under 5-fold cross-validation.

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

eess.AS · 2026-04-23 · unverdicted · novelty 5.0

A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.

End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering

cs.SD · 2025-11-12 · unverdicted · novelty 5.0

CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

cs.AI · 2026-04-09 · unverdicted · novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.

Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

cs.SD · 2024-09-27 · unverdicted · novelty 3.0

A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.

citing papers explorer

Showing 11 of 11 citing papers.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 16
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
OmniTrace: A Unified Framework for Generation-Time Attribution in Omni-Modal LLMs cs.CL · 2026-03-20 · unverdicted · none · ref 27
OmniTrace converts token-level signals into span-level cross-modal attributions for open-ended generation in omni-modal LLMs via generation-time tracing.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise cs.IR · 2026-02-13 · unverdicted · none · ref 11
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection cs.SD · 2026-04-02 · unverdicted · none · ref 34
FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.
Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction cs.CL · 2025-02-17 · unverdicted · none · ref 46
Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.
Speech-based Psychological Crisis Assessment using LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 27
LLM system with paralinguistic cue injection and auxiliary reasoning training reaches 0.802 macro F1 and 0.805 accuracy on three-class speech-based crisis level classification under 5-fold cross-validation.
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge eess.AS · 2026-04-23 · unverdicted · none · ref 22
A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.
End-to-end Contrastive Language-Speech Pretraining Model For Long-form Spoken Question Answering cs.SD · 2025-11-12 · unverdicted · none · ref 13
CLSR is an end-to-end contrastive language-speech retriever using an intermediate text-like conversion step to improve retrieval of relevant segments from long audio for spoken question answering.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 20
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory cs.AI · 2026-04-09 · unverdicted · none · ref 7
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS cs.SD · 2024-09-27 · unverdicted · none · ref 33
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer