arxiv: 2605.15044 · v1 · submitted 2026-05-14 · 💻 cs.SD · cs.AI· cs.LG· cs.MM· eess.AS

Recognition: no theorem link

SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

KiHyun Nam , Jungwoo Heo , Siu Bae , Ha-Jin Yu , Joon Son Chung

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:16 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGcs.MMeess.AS

keywords speaker verificationaudio LLMspeaker profilingverification reasoninghierarchical tokenizernatural language interfacerecording conditionsspeaker understanding

0 comments

The pith

SpeakerLLM unifies speaker profiling, recording-condition analysis, utterance comparison and evidence-organized verification reasoning inside a natural-language audio-LLM interface.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SpeakerLLM to give audio large language models the ability to handle speaker-specific tasks that current systems address only with scalar scores or short labels. It combines single-utterance profiling, recording-condition understanding, pair-wise comparison, and step-by-step verification reasoning so that the model can output linguistic traces instead of isolated decisions. A reader would care because audio-first agents in robots, wearables, and conversational systems need to authorize users and adapt interactions while also explaining why a voice matches or does not match. The method relies on a hierarchical speaker tokenizer that processes both coarse identity cues and fine acoustic details, together with specially constructed reasoning targets that keep profile evidence separate from the final verdict.

Core claim

SpeakerLLM is a speaker-specialized audio-LLM that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. Its core component is a hierarchical speaker tokenizer that uses utterance-level embeddings to summarize identity and profile cues while retaining frame-level features for fine-grained acoustic descriptors. The framework also supplies verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and assemble them into structured traces.

What carries the argument

The hierarchical speaker tokenizer, which captures multiple granularities of speaker evidence by combining utterance-level embeddings for identity and profile cues with frame-level features for acoustic details.

If this is right

SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs.
SpeakerLLM-VR preserves strong generated-verdict accuracy while outputting decision traces grounded in the supervised verification reasoning schema.
The natural-language interface supports complex speaker tasks that combine profiling, comparison, and explicit reasoning in one model.
The released metadata-enriched supervision dataset and target-construction code allow other researchers to reproduce and extend the same training regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the traces remain reliable in deployment, conversational agents could explain speaker-verification outcomes to users in ordinary language rather than numeric scores.
Recording-condition awareness built into the same model could improve personalization for screenless devices where acoustic context changes frequently.
Releasing the dataset may encourage work on grounded reasoning for other audio attributes beyond speaker identity.

Load-bearing premise

The hierarchical speaker tokenizer captures multiple granularities of speaker evidence effectively enough to support both profiling and verification reasoning.

What would settle it

A test set of utterance pairs with known speaker matches or mismatches in which the model's generated reasoning traces either contradict the acoustic evidence or produce incorrect final verdicts.

Figures

Figures reproduced from arXiv: 2605.15044 by Ha-Jin Yu, Joon Son Chung, Jungwoo Heo, KiHyun Nam, Siu Bae.

**Figure 2.** Figure 2: Overview of SpeakerLLM. A frozen speaker encoder extracts a speaker embedding [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues. Conventional speaker verification systems provide strong scalar scores but little linguistic evidence, while current audio-LLMs and speaker-aware language models have limited ability to organize speaker information beyond binary labels or descriptive profiles. We present SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. We construct verification-reasoning targets and a decision-composition policy that separate profile-level evidence from the final same-or-different decision and organize recording condition, profile evidence, and the decision into a structured trace. At its core, SpeakerLLM uses a hierarchical speaker tokenizer designed to capture multiple granularities of speaker evidence. Utterance-level speaker embeddings summarize identity and profile-level cues, whereas frame-level speaker features preserve fine-grained acoustic descriptors. Experiments show that SpeakerLLM-Base improves speaker-profile and recording-condition understanding over general audio-LLMs, while SpeakerLLM-VR preserves strong generated-verdict accuracy and produces decision traces grounded in the supervised verification reasoning schema. We will release the metadata-enriched supervision dataset and target-construction code for reproducibility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpeakerLLM adds a hierarchical tokenizer and structured verification targets to audio-LLMs, unifying profiling and reasoned decisions with reported gains on understanding tasks.

read the letter

Here's my read on the SpeakerLLM paper. The central advance is a speaker-specialized audio-LLM that unifies single-utterance profiling, recording condition understanding, utterance comparison, and evidence-organized verification reasoning in one natural-language system. The hierarchical speaker tokenizer, which combines utterance embeddings with frame-level features, and the constructed verification-reasoning targets with a decision-composition policy are the novel elements. The paper does a good job showing that SpeakerLLM-Base outperforms general audio-LLMs on profile and condition tasks, while SpeakerLLM-VR keeps strong verdict accuracy and outputs structured, grounded traces. The motivation for moving beyond scalar scores to explainable reasoning for agents is clear. Releasing the supervision dataset and target code supports follow-up work. On the soft side, the effectiveness of the hierarchical tokenizer in capturing different granularities of speaker evidence would benefit from explicit ablations in the results. The assumption that the reasoning targets produce truly grounded traces rather than schema-following outputs needs verification through the full evaluation. No major flaws in the argument structure, though. This work is aimed at people developing audio-LLMs for physical AI, robots, or wearables where speaker awareness matters. Readers in speaker verification or multimodal reasoning will get the most from it. It deserves serious peer review because the unification is new and the experiments address the claims directly. I would send it to referees.

Referee Report

0 major / 2 minor

Summary. The manuscript presents SpeakerLLM, a speaker-specialized audio-LLM framework that unifies single-utterance speaker profiling, recording-condition understanding, utterance-pair speaker comparison, and evidence-organized verification reasoning within a natural-language interface. It introduces a hierarchical speaker tokenizer combining utterance-level embeddings for identity/profile cues with frame-level features for fine-grained acoustics, along with constructed verification-reasoning targets and a decision-composition policy that separate profile evidence from the final same/different verdict to produce structured, schema-grounded traces. Experiments indicate that SpeakerLLM-Base improves on profile and condition tasks over general audio-LLMs, while SpeakerLLM-VR preserves verdict accuracy with interpretable decision traces.

Significance. If the reported gains and grounding properties hold, the work would be significant for audio-first agents in physical AI, conversational systems, and wearables by bridging the gap between scalar speaker verification and descriptive language models, enabling both performance improvements and evidence-based reasoning in a single model.

minor comments (2)

The abstract asserts performance gains and grounded traces without quantitative metrics, baselines, or error bars; moving key numerical results (e.g., accuracy deltas, comparison to general audio-LLMs) into the abstract would improve immediate assessability while preserving the full evaluation details in the experiments section.
Clarify the exact architecture of the hierarchical speaker tokenizer (e.g., how utterance-level and frame-level features are fused into the LLM input) with a diagram or pseudocode in the methods section to aid reproducibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of SpeakerLLM, the recognition of its potential significance for audio-first agents, and the recommendation of minor revision. We have prepared the manuscript accordingly.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces SpeakerLLM as a newly constructed framework using a hierarchical speaker tokenizer and verification-reasoning targets to unify profiling, condition understanding, comparison, and grounded traces. No equations, derivations, or load-bearing steps are presented that reduce results to fitted parameters defined by the same data or to self-citations. The abstract and description emphasize explicit construction of targets and policy, with experiments directly testing the unification claim on profile/condition tasks and verdict accuracy. This is self-contained against external benchmarks with no tautological reductions visible.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the effectiveness of newly introduced components whose performance is asserted but not quantified in the provided abstract.

axioms (1)

standard math Standard assumptions underlying audio-LLM training and speaker embedding extraction
The framework builds on existing LLM and speaker-verification architectures without stating new mathematical axioms.

invented entities (2)

hierarchical speaker tokenizer no independent evidence
purpose: Capture utterance-level identity cues and frame-level acoustic descriptors simultaneously
New component introduced to support multiple granularities of speaker evidence
verification-reasoning targets and decision-composition policy no independent evidence
purpose: Separate profile-level evidence from the final same-or-different decision and produce structured traces
Constructed supervision schema introduced for the verification reasoning task

pith-pipeline@v0.9.0 · 5599 in / 1314 out tokens · 51538 ms · 2026-05-15T03:16:37.586105+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[2]

V oice user interface: Literature review, challenges and future directions.System Theory, Control and Computing Journal, 1(2):65–89, 2021

Francis Rakotomalala, Hasindraibe Niriarijaona Randriatsarafara, Aimé Richard Hajalalaina, and Ndaohialy Manda Vy Ravonimanantsoa. V oice user interface: Literature review, challenges and future directions.System Theory, Control and Computing Journal, 1(2):65–89, 2021

work page 2021
[3]

VisionClaw: Always-On AI Agents through Smart Glasses

Xiaoan Liu, DaeHo Lee, Eric J Gonzalez, Mar Gonzalez-Franco, and Ryo Suzuki. Visionclaw: Always-on ai agents through smart glasses.arXiv preprint arXiv:2604.03486, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, 2023

work page 2023
[5]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

SALMONN: Towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun MA, and Chao Zhang. SALMONN: Towards generic hearing abilities for large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=14rn7HpKVk

work page 2024
[7]

Audio flamingo 3: Advancing audio intelligence with fully open large audio language models

Sreyan Ghosh, Arushi Goel, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URL htt...

work page 2026
[8]

Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet

Najim Dehak, Patrick J. Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. Front-end factor analysis for speaker verification.IEEE Transactions on Audio, Speech, and Language Processing, 19(4):788–798, 2011. doi: 10.1109/TASL.2010.2064307

work page doi:10.1109/tasl.2010.2064307 2011
[9]

X-Vectors: Robust DNN embeddings for speaker recognition

David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. X-Vectors: Robust DNN embeddings for speaker recognition. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5329– 5333, 2018. doi: 10.1109/ICASSP.2018.8461375

work page doi:10.1109/icassp.2018.8461375 2018
[10]

Generalized end-to-end loss for speaker verification

Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for speaker verification. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4879–4883, 2018. doi: 10.1109/ICASSP.2018.8462665

work page doi:10.1109/icassp.2018.8462665 2018
[11]

Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

Tim Miller. Explanation in artificial intelligence: Insights from the social sciences.Artificial intelligence, 267:1–38, 2019

work page 2019
[12]

Questioning the ai: informing design practices for explainable ai user experiences

Q Vera Liao, Daniel Gruen, and Sarah Miller. Questioning the ai: informing design practices for explainable ai user experiences. InProceedings of the 2020 CHI conference on human factors in computing systems, pages 1–15, 2020

work page 2020
[13]

ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification

Brecht Desplanques, Jenthe Thienpondt, and Kris Demuynck. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Interspeech 2020, pages 3830–3834, 2020. doi: 10.21437/Interspeech.2020-2650

work page doi:10.21437/interspeech.2020-2650 2020
[14]

Explainable attribute-based speaker verification.arXiv preprint arXiv:2405.19796, 2024

Xiaoliang Wu, Chau Luu, Peter Bell, and Ajitha Rajan. Explainable attribute-based speaker verification.arXiv preprint arXiv:2405.19796, 2024

work page arXiv 2024
[15]

Concept bottleneck models

Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang. Concept bottleneck models. InInternational conference on machine learning, pages 5338–5348. PMLR, 2020. 10

work page 2020
[16]

V o-Ve: An Explainable V oice-Vector for Speaker Identity Evaluation

Jaejun Lee and Kyogu Lee. V o-Ve: An Explainable V oice-Vector for Speaker Identity Evaluation. InInterspeech 2025, pages 3988–3992, 2025. doi: 10.21437/Interspeech.2025-591

work page doi:10.21437/interspeech.2025-591 2025
[17]

Speaker verification in agent-generated conversations

Yizhe Yang, Palakorn Achananuparp, He-Yan Huang, Jing Jiang, and Ee-Peng Lim. Speaker verification in agent-generated conversations. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5655–5676, 2024

work page 2024
[18]

SpeakerLM: End-to-end versatile speaker diarization and recog- nition with multimodal large language models

Han Yin, Yafeng Chen, Chong Deng, Luyao Cheng, Hui Wang, Chao-Hong Tan, Qian Chen, Wen Wang, and Xiangang Li. SpeakerLM: End-to-end versatile speaker diarization and recog- nition with multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 34467–34475, 2026

work page 2026
[19]

Colmbo: Speaker language model for descriptive profiling.arXiv preprint arXiv:2506.09375, 2025

Massa Baali, Shuo Han, Syed Abdul Hannan, Purusottam Samal, Karanveer Singh, Soham Deshmukh, Rita Singh, and Bhiksha Raj. Colmbo: Speaker language model for descriptive profiling.arXiv preprint arXiv:2506.09375, 2025

work page arXiv 2025
[20]

cRc Press, 2002

Phil Rose.Forensic speaker identification. cRc Press, 2002

work page 2002
[21]

Chi, Quoc V Le, and Denny Zhou

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/ forum?...

work page 2022
[22]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023

work page 2023
[24]

V oxCeleb: A Large-Scale Speaker Identification Dataset

Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. V oxCeleb: A Large-Scale Speaker Identification Dataset. InInterspeech 2017, pages 2616–2620, 2017. doi: 10.21437/Interspeech. 2017-950

work page doi:10.21437/interspeech 2017
[25]

V oxCeleb2: Deep Speaker Recogni- tion

Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. V oxCeleb2: Deep Speaker Recogni- tion. InInterspeech 2018, pages 1086–1090, 2018. doi: 10.21437/Interspeech.2018-1929

work page doi:10.21437/interspeech.2018-1929 2018
[26]

LibriTTS-R: A Restored Multi- Speaker Text-to-Speech Corpus

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. LibriTTS-R: A Restored Multi- Speaker Text-to-Speech Corpus. InInterspeech 2023, pages 5496–5500, 2023. doi: 10.21437/ Interspeech.2023-1584

work page 2023
[27]

V oxceleb enrichment for age and gender recognition

Khaled Hechmi, Trung Ngo Trong, Ville Hautamäki, and Tomi Kinnunen. V oxceleb enrichment for age and gender recognition. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 687–693. IEEE, 2021

work page 2021
[28]

LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning

Masaya Kawamura, Ryuichi Yamamoto, Yuma Shirahata, Takuya Hasumi, and Kentaro Tachibana. LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning. InInterspeech 2024, pages 1850–1854, 2024. doi: 10.21437/Interspeech.2024-692

work page doi:10.21437/interspeech.2024-692 2024
[29]

MUSAN: A Music, Speech, and Noise Corpus

David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

Seltzer, and Sanjeev Khudanpur

Tom Ko, Vijayaditya Peddinti, Daniel Povey, Michael L. Seltzer, and Sanjeev Khudanpur. A study on data augmentation of reverberant speech for robust speech recognition. In2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5220–5224, 2017. doi: 10.1109/ICASSP.2017.7953152

work page doi:10.1109/icassp.2017.7953152 2017
[31]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023. 11

work page 2023
[32]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

work page 2022
[33]

Reshape Dimensions Network for Speaker Recognition

Ivan Yakovlev, Rostislav Makarov, Andrei Balykin, Pavel Malov, Anton Okhotnikov, and Nikita Torgashov. Reshape Dimensions Network for Speaker Recognition. InInterspeech 2024, pages 3235–3239, 2024. doi: 10.21437/Interspeech.2024-2116

work page doi:10.21437/interspeech.2024-2116 2024
[34]

Qwen2.5 Technical Report

Qwen, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li, Ti...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019. URL https://openreview.net/forum? id=Bkg6RiCqY7

work page 2019
[36]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=mZn2Xyh9Ec

work page 2024
[37]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Speaker verification with speech-aware llms: Evaluation and augmentation.arXiv preprint arXiv:2603.10827, 2026

Thomas Thebaud, Yuzhe Wang, Laureano Moro-Velazquez, Jesus Villalba-Lopez, and Najim Dehak. Speaker verification with speech-aware llms: Evaluation and augmentation.arXiv preprint arXiv:2603.10827, 2026

work page arXiv 2026
[39]

TinyLlama: An Open-Source Small Language Model

Peiyuan Zhang, Guangtao Zeng, Tianduo Wang, and Wei Lu. TinyLLaMA: An open-source small language model.arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[41]

Schroeder

Manfred R. Schroeder. New method of measuring reverberation time.The Journal of the Acoustical Society of America, 37(3):409–412, 1965. doi: 10.1121/1.1909343. 12 Technical Appendices and Supplementary Material This supplementary material complements the main paper by providing the following sections. To support code reproducibility, we will release the s...

work page doi:10.1121/1.1909343 1965
[42]

Many attributes are similar. However, the latent speaker-identity cues show stronger separation. . . . de- termined to be from different speakers

Connector— selected by the alignment between profile-support level and the ground-truth same/different label. 4.Verification verdict— selected by the same alignment. The fullDECISIONis their concatenation: DECISION=env clause| {z } pair severity +profile summary+connector+verification verdict| {z } profile support×GT label This construction separates prof...

work page