Speakerlm: End-to-end versa- tile speaker diarization and recognition with multimodal large lan- guage models

· 2025 · arXiv 2508.06372

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

eess.AS · 2026-04-03 · unverdicted · novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.

MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios

eess.AS · 2026-06-22 · unverdicted · novelty 6.0

MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.

Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Introduces TPI-Train dataset and TPI-Bench to mitigate semantic shortcut learning in SLMs by enforcing acoustic cue prioritization for third-party interruption handling.

Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition

eess.AS · 2026-06-11 · unverdicted · novelty 4.0

LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.

SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription

eess.AS · 2026-06-01 · unverdicted · novelty 4.0

SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.

PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory

cs.AI · 2026-04-09 · unverdicted · novelty 4.0

PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR eess.AS · 2026-04-03 · unverdicted · none · ref 6
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware cache for long audio.
MSU-Bench: Towards Speaker-Centric Understanding in Conversational Multi-Speaker Scenarios eess.AS · 2026-06-22 · unverdicted · none · ref 14
MSU-Bench is a new two-tier benchmark covering speaker grounding to dialogue reasoning in multi-speaker conversations, with Gemini-assisted annotation and human verification.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models eess.AS · 2026-04-24 · unverdicted · none · ref 61
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
Still Between Us? Evaluating and Improving Voice Assistant Robustness to Third-Party Interruptions cs.CL · 2026-04-19 · unverdicted · none · ref 2
Introduces TPI-Train dataset and TPI-Bench to mitigate semantic shortcut learning in SLMs by enforcing acoustic cue prioritization for third-party interruption handling.
Balancing ASR and diarization in end-to-end LLMs for multi-talker speech recognition eess.AS · 2026-06-11 · unverdicted · none · ref 22
LLM-based multi-talker ASR with dual-encoder, feature interleaving, length-aware speaker loss, and adaptive ASR threshold achieves 18% and 24% relative gains over baselines on AliMeeting and Aishell4.
SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription eess.AS · 2026-06-01 · unverdicted · none · ref 8
SoulX-Transcriber is a unified LLM framework for end-to-end multi-speaker transcription using two-stage training (speaker-aware pre-training then supervised fine-tuning) that reports strong results on AliMeeting, AISHELL-4, and AMI.
PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory cs.AI · 2026-04-09 · unverdicted · none · ref 29
PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.

Speakerlm: End-to-end versa- tile speaker diarization and recognition with multimodal large lan- guage models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer