pith. sign in
Pith Number

pith:KCXG5LP5

pith:2026:KCXG5LP5BUZSBAOZEMVTN3Z5PU
not attested not anchored not stored refs resolved

CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR

Chikara Maeda, Chyi-Jiunn Lin, Muhammad Shakeel, Shinji Watanabe, Yosuke Fukumoto

CALM integrates speaker embeddings for target extraction with dynamic vocabulary biasing to halve biased error rates in overlapping multi-speaker ASR.

arxiv:2601.22792 v2 · 2026-01-30 · eess.AS · cs.CL · cs.SD

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{KCXG5LP5BUZSBAOZEMVTN3Z5PU}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On two-speaker mixtures, CALM reduces biased word error rate (B-WER) from 12.7 to 4.7 on LibriSpeech2Mix and biased character error rate (B-CER) from 16.6 to 8.4 on CSJMix2 (eval3), demonstrating the effectiveness of joint acoustic-linguistic modeling across languages.

C2weakest assumption

That the simulated two-speaker mixtures (LibriSpeechMix, CSJMix) and the IHM-mix condition of AMI sufficiently represent the acoustic and linguistic statistics of real overlapping conversations where speaker turns, noise, and context vary more widely.

C3one line summary

CALM jointly models acoustic speaker identity and linguistic context to cut biased error rates by more than half on two-speaker English and Japanese mixtures.

References

53 extracted · 53 resolved · 1 Pith anchors

[1] INTRODUCTION Single-speaker automatic speech recognition (ASR) systems have achieved state-of-the-art (SOTA) performance across many speech- processing tasks [1–3]. However, in multi-speaker settings
[2] CALM: Joint Contextual Acoustic-Linguistic Modeling for Personalization of Multi-Speaker ASR 2026 · arXiv:2601.22792
[3] Frame-level target- speaker activity posteriors are computed as: P vad =σ(W vad ˆH(L) +b vad),(11) withP vad ∈[0,1] T enc
[4] EXPERIMENTS The CALM framework is built on ESPnet [45], pairing a Conformer encoder with a Transformer decoder. The Conformer has 12 lay- ers with 4 heads and 1024 linear units (kernel size 31) and ap 2048
[5] However, unlike in simulated conditions, overall WER in- creases from 37.4 to 39.1 absolute points. Our error analysis indi- cates that this degradation is primarily driven by an increase in inser- ti

Formal links

1 machine-checked theorem link

Cited by

1 paper in Pith

Receipt and verification
First computed 2026-05-18T02:45:05.698741Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

50ae6eadfd0d332081d9232b36ef3d7d15a57379dc8f8b4303a5fe05bf752bb9

Aliases

arxiv: 2601.22792 · arxiv_version: 2601.22792v2 · doi: 10.48550/arxiv.2601.22792 · pith_short_12: KCXG5LP5BUZS · pith_short_16: KCXG5LP5BUZSBAOZ · pith_short_8: KCXG5LP5
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/KCXG5LP5BUZSBAOZEMVTN3Z5PU \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 50ae6eadfd0d332081d9232b36ef3d7d15a57379dc8f8b4303a5fe05bf752bb9
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "cd2790a0db990c662e3544ea8ad77ba01427a8f6d561f01a89a1d7d1454cfb20",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.SD"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "eess.AS",
    "submitted_at": "2026-01-30T10:12:16Z",
    "title_canon_sha256": "539e452adb9e8cad4111a9bbe956f431d5145aca67b5f57c3d0557c80e80ca06"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.22792",
    "kind": "arxiv",
    "version": 2
  }
}