Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

· 2026 · eess.AS · arXiv 2606.13095

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Multi-talker speech recognition is often addressed by combining automatic speech recognition (ASR) and speaker diarization in a pipeline system. Recently, LLM-based approaches have shown promise by jointly modeling semantic and speaker information, but they typically require large-scale multi-talker corpora that are costly to annotate. In this paper, we investigate how to efficiently train an LLM-based system with limited real-recorded data while maintaining high accuracy in speaker attribution. We propose several strategies: (1) a dual-encoder architecture to extract semantic and speaker features, (2) a feature interleaving format to merge these features as the inputs to the LLM, (3) a length-aware speaker ID loss to enhance diarization capability, and (4) an adaptive threshold strategy for ASR loss computation to mitigate hallucinations caused by speech overlaps. These strategies balance training between ASR and diarization tasks. Our system outperforms open-source baseline approaches, achieving relative improvements of 18% on the AliMeeting corpus and 24% on the Aishell4 corpus.

representative citing papers

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

eess.AS · 2026-06-11 · unverdicted · novelty 4.0

LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.

citing papers explorer

Showing 1 of 1 citing paper.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition eess.AS · 2026-06-11 · unverdicted · none · ref 2 · internal anchor
LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

fields

years

verdicts

representative citing papers

citing papers explorer