pith. sign in

arxiv: 2605.16026 · v1 · pith:GENG5NZNnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI

From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation

Pith reviewed 2026-05-20 19:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords speech-to-speech translationtypological priorsmultilingual S2STlanguage conditioningdata efficiencySpeechLLMCVSS-CDual-CTC
0
0 comments X

The pith

Replacing flat language labels with structured typological priors improves multilingual speech-to-speech translation performance and data efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing compositional S2ST systems built on SpeechLLMs typically encode source languages as independent flat embeddings, which ignores shared linguistic structure across languages and limits adaptation when supervised data is scarce. The paper proposes S2ST-Omni 2 to reformulate language conditioning by drawing on typological priors from linguistic databases instead. This reformulation is applied at three levels: hierarchical encoding of source languages, dynamically gated language-aware Dual-CTC for acoustic modulation, and typology-aware prompting for the decoder. Experiments on CVSS-C demonstrate higher average scores across BLEU, COMET, ASR-BLEU, and BLASER 2.0, with ablations showing the three strategies complement one another and a low-resource Japanese-to-English test using roughly 3 hours of data confirming improved efficiency. A sympathetic reader would care because the approach supplies explicit inductive biases that could make multilingual translation systems require far less paired speech data for new language pairs.

Core claim

S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol on CVSS-C by systematically replacing flat language labels with structured typological priors through typology-informed hierarchical language encoding, dynamically-gated language-aware Dual-CTC, and typology-aware LLM prompting, with ablation studies confirming complementary benefits from the three strategies and controlled analyses indicating improved data efficiency in low-resource many-to-one settings.

What carries the argument

Typology-informed hierarchical language encoding, dynamically-gated language-aware Dual-CTC, and typology-aware LLM prompting as structured replacements for flat language embeddings in a many-to-one compositional S2ST framework.

If this is right

  • The representation-level, acoustic-level, and decoding-level strategies provide complementary benefits that together raise average scores across BLEU, COMET, ASR-BLEU, and BLASER 2.0.
  • Explicit typological priors yield measurable gains in data efficiency, as shown by competitive results in a Japanese-to-English setup with approximately 3 hours of supervised data.
  • Many-to-one compositional S2ST benefits when source-language information is encoded through systematic linguistic structure rather than isolated labels.
  • Ablation results indicate that removing any one of the three typology-based components reduces overall performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same hierarchical-plus-gated conditioning pattern could be tested on other speech tasks such as multilingual recognition or voice conversion to check for similar efficiency gains.
  • Typological priors might reduce the amount of new paired data needed when adding languages that share structural traits with existing ones.
  • The approach highlights a route for incorporating external linguistic resources directly into neural conditioning without requiring full retraining for each new language pair.

Load-bearing premise

Typological features drawn from external linguistic databases supply reliable inductive biases that generalize to the acoustic and semantic demands of speech-to-speech translation rather than merely correlating with surface-level language identity.

What would settle it

An ablation that replaces the typological priors with random or language-identity-preserving but structurally shuffled features and measures whether the reported gains on CVSS-C and the 3-hour Japanese-to-English setup disappear.

Figures

Figures reproduced from arXiv: 2605.16026 by Jianjun Zhao, Lei Ma, Liang Zhang, Xiongfei Wu, Yang Hou, Yu Pan, Yves Le Traon.

Figure 1
Figure 1. Figure 1: Overall architecture and two-stage training pipeline of S2ST-Omni 2. LA denotes language-aware, CE is cross-entropy, and src/tgt denote source/target. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: BLEU under varying training data budgets for S2ST-Omni and S2ST-Omni 2. (a) Average BLEU, with relative gains computed over S2ST-Omni. [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes S2ST-Omni 2, a many-to-one compositional speech-to-speech translation framework that replaces flat language labels with structured typological priors drawn from external linguistic databases. The method applies these priors at three levels: typology-informed hierarchical language encoding for source representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder guidance. On the CVSS-C benchmark, S2ST-Omni 2 reports superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 relative to representative S2ST baselines. Ablation studies indicate complementary benefits from the three conditioning strategies, while controlled data-budget experiments and a Japanese-to-English setup using approximately 3 hours of supervised data suggest gains in data efficiency.

Significance. If the central attribution to typological inductive biases holds, the work would offer a practical advance for data-efficient multilingual S2ST by leveraging cross-lingual linguistic structure rather than language-as-label embeddings. The multi-level integration and explicit use of external typological resources constitute a clear methodological contribution, and the low-resource Japanese-to-English result provides a concrete demonstration of potential utility. The paper already supplies ablation evidence of complementarity and benchmark comparisons; strengthening the isolation of typology from general conditioning richness would elevate its impact.

major comments (2)
  1. [Ablation studies] Ablation studies: the reported complementarity among representation-level, acoustic-level, and decoding-level strategies does not include a control condition that supplies structured language embeddings of matched dimensionality and richness but without typological content (e.g., random or language-identity-only hierarchical vectors). Without this control, it remains unclear whether performance and data-efficiency gains derive from typological inductive biases or simply from richer, non-flat conditioning signals.
  2. [Experiments on CVSS-C] Experiments on CVSS-C: the claim of superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 is presented without statistical significance tests, error bars, or exhaustive specification of baseline hyper-parameters and data splits. These omissions reduce confidence that the observed margins reliably exceed those attributable to implementation variance.
minor comments (2)
  1. [Abstract] The abstract and method sections would benefit from an explicit statement of which typological database (e.g., WALS) and feature subset are used, together with a brief justification for feature selection and weighting.
  2. Notation for the hierarchical encoder and gated Dual-CTC modules could be clarified by adding one or two equations that define the gating mechanism and the typology-informed embedding construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Ablation studies: the reported complementarity among representation-level, acoustic-level, and decoding-level strategies does not include a control condition that supplies structured language embeddings of matched dimensionality and richness but without typological content (e.g., random or language-identity-only hierarchical vectors). Without this control, it remains unclear whether performance and data-efficiency gains derive from typological inductive biases or simply from richer, non-flat conditioning signals.

    Authors: We appreciate this suggestion for isolating the role of typological content. Our existing ablations demonstrate complementary gains by ablating each of the three conditioning strategies in turn, with the full model outperforming partial variants on CVSS-C and in low-resource settings. However, we agree that a matched non-typological control (e.g., random hierarchical vectors) would more directly test whether gains arise from typological structure rather than conditioning richness alone. In the revised version we will add this control experiment using random hierarchical embeddings of identical dimensionality and report the results alongside the existing ablations. revision: yes

  2. Referee: Experiments on CVSS-C: the claim of superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 is presented without statistical significance tests, error bars, or exhaustive specification of baseline hyper-parameters and data splits. These omissions reduce confidence that the observed margins reliably exceed those attributable to implementation variance.

    Authors: We acknowledge that additional statistical rigor and experimental detail would increase confidence in the reported improvements. In the revision we will (i) report error bars from multiple random seeds, (ii) include statistical significance tests (paired bootstrap or t-tests) for key comparisons against baselines, and (iii) provide exhaustive hyper-parameter settings and data-split specifications for all baselines and our model in a new appendix section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; gains shown via external benchmarks and ablations rather than internal fits or self-citations

full rationale

The paper's central claims rest on empirical evaluation of S2ST-Omni 2 on the CVSS-C benchmark using BLEU, COMET, ASR-BLEU, and BLASER 2.0, plus controlled data-budget experiments on a ~3-hour Japanese-English split. The three conditioning strategies (hierarchical encoding, gated Dual-CTC, LLM prompting) are presented as architectural choices informed by external typological databases (e.g., WALS-style features), with ablations demonstrating complementarity. No equations or derivations reduce a reported prediction to a fitted parameter by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Performance numbers derive from held-out test sets rather than being forced by the model's own training objectives or normalizations, satisfying the criterion for a self-contained empirical result against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that typological databases encode transferable linguistic structure for speech translation tasks; the model itself contains the usual neural-network free parameters whose values are learned from data.

free parameters (1)
  • typological feature selection and weighting
    Choice of which typological attributes to include and how to embed them hierarchically is a modeling decision that affects the conditioning signal.
axioms (1)
  • domain assumption Typological features from linguistic resources accurately capture cross-lingual similarities relevant to acoustic and semantic translation.
    Invoked when constructing the hierarchical language encoding and the typology-aware prompts.

pith-pipeline@v0.9.0 · 5820 in / 1304 out tokens · 46179 ms · 2026-05-20T19:21:28.116151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 6 internal anchors

  1. [1]

    Creating corpora for speech-to-speech translation

    G.-i. Kikui, E. Sumita, T. Takezawa, and S. Yamamoto, “Creating corpora for speech-to-speech translation.” inINTERSPEECH, 2003, pp. 381–384

  2. [2]

    Textless speech-to-speech translation on real data,

    A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y . Adi, J. Pino, J. Guet al., “Textless speech-to-speech translation on real data,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 860–872

  3. [3]

    Conformer: Convolution- augmented Transformer for Speech Recognition,

    A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented Transformer for Speech Recognition,” inInterspeech 2020, 2020, pp. 5036–5040

  4. [4]

    Hybridformer: Improving squeezeformer with hybrid attention and nsr mechanism,

    Y . Yang, Y . Pan, J. Yin, J. Han, L. Ma, and H. Lu, “Hybridformer: Improving squeezeformer with hybrid attention and nsr mechanism,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  5. [5]

    Adaptive machine trans- lation with large language models,

    Y . Moslem, R. Haque, J. Kelleher, and A. Way, “Adaptive machine trans- lation with large language models,” inProceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, pp. 227–237

  6. [6]

    Towards making the most of chatgpt for machine translation,

    K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y . Ouyang, and D. Tao, “Towards making the most of chatgpt for machine translation,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 5622–5633

  7. [7]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024

  8. [8]

    Takin: A cohort of superior quality zero-shot speech generation models,

    S. Chen, Y . Feng, L. He, T. He, W. He, Y . Hu, B. Lin, Y . Lin, Y . Pan, P. Tanet al., “Takin: A cohort of superior quality zero-shot speech generation models,”arXiv preprint arXiv:2409.12139, 2024

  9. [9]

    Direct Speech-to-Speech Translation with a Sequence-to- Sequence Model,

    Y . Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y . Wu, “Direct Speech-to-Speech Translation with a Sequence-to- Sequence Model,” inInterspeech 2019, 2019, pp. 1123–1127

  10. [10]

    Translatotron 2: High-quality direct speech-to-speech translation with voice preserva- tion,

    Y . Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: High-quality direct speech-to-speech translation with voice preserva- tion,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 10 120–10 134

  11. [11]

    C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al

    L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffmanet al., “Seamlessm4t: massively multilingual & multimodal machine transla- tion,”arXiv preprint arXiv:2308.11596, 2023. 10

  12. [12]

    Can we achieve high-quality direct speech-to-speech translation without parallel speech data?

    Q. Fang, S. Zhang, Z. Ma, M. Zhang, and Y . Feng, “Can we achieve high-quality direct speech-to-speech translation without parallel speech data?” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 7264– 7277

  13. [13]

    Daspeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation,

    Q. Fang, Y . Zhou, and Y . Feng, “Daspeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation,”Advances in Neural Information Processing Systems, vol. 36, pp. 72 604–72 623, 2023

  14. [14]

    S2st- omni: Hierarchical language-aware speechllm adaptation for multilin- gual speech-to-speech translation,

    Y . Pan, X. Wu, Y . Yang, J. Yao, C. Maxime, L. Ma, and J. Zhao, “S2st- omni: Hierarchical language-aware speechllm adaptation for multilin- gual speech-to-speech translation,”arXiv preprint arXiv:2506.11160, 2025

  15. [15]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  16. [16]

    Qwen Technical Report

    J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

  17. [17]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  18. [18]

    Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

    D. Zhanget al., “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,”arXiv preprint arXiv:2305.11000, 2023

  19. [19]

    Audiogpt: Understanding and generating speech, music, sound, and talking head,

    R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J. Huang, J. Liuet al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 802–23 804

  20. [20]

    Simuls2s-llm: Unlock- ing simultaneous inference of speech llms for speech-to-speech transla- tion,

    K. Deng, W. Chen, X. Chen, and P. Woodland, “Simuls2s-llm: Unlock- ing simultaneous inference of speech llms for speech-to-speech transla- tion,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 718– 16 734

  21. [21]

    Rosettaspeech: Zero-shot speech-to-speech translation from monolingual data,

    Z. Zheng, X. Sun, T. Dinh, A. Yanamandra, A. Jain, Z. Liu, S. Hadap, V . Bhat, M. Aggarwal, G. Medioniet al., “Rosettaspeech: Zero-shot speech-to-speech translation from monolingual data,”arXiv preprint arXiv:2511.20974, 2025

  22. [22]

    Comrie,Language universals and linguistic typology: Syntax and morphology

    B. Comrie,Language universals and linguistic typology: Syntax and morphology. University of Chicago press, 1989

  23. [23]

    Uriel and lang2vec: Representing languages as typological, geographi- cal, and phylogenetic vectors,

    P. Littell, D. R. Mortensen, K. Lin, K. Kairis, C. Turner, and L. Levin, “Uriel and lang2vec: Representing languages as typological, geographi- cal, and phylogenetic vectors,” inProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 8–14

  24. [24]

    Modeling language variation and universals: A survey on typological linguistics for natural language processing,

    E. M. Ponti, H. O’horan, Y . Berzak, I. Vuli ´c, R. Reichart, T. Poibeau, E. Shutova, and A. Korhonen, “Modeling language variation and universals: A survey on typological linguistics for natural language processing,”Computational Linguistics, vol. 45, no. 3, pp. 559–601, 2019

  25. [25]

    Bridging linguistic typology and multilingual machine translation with multi-view language representa- tions,

    A. Oncevay, B. Haddow, and A. Birch, “Bridging linguistic typology and multilingual machine translation with multi-view language representa- tions,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 2391–2406

  26. [26]

    Unifying cross-lingual transfer across scenarios of resource scarcity,

    A. Ansell, M. Parovi ´c, I. Vuli ´c, A. Korhonen, and E. M. Ponti, “Unifying cross-lingual transfer across scenarios of resource scarcity,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 3980–3995

  27. [27]

    Analyzing the evaluation of cross-lingual knowledge transfer in multilingual language models,

    S. Rajaee and C. Monz, “Analyzing the evaluation of cross-lingual knowledge transfer in multilingual language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2895– 2914

  28. [28]

    Cvss corpus and massively multilingual speech-to-speech translation,

    Y . Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” inProceedings of the thirteenth language resources and evaluation conference, 2022, pp. 6691–6703

  29. [29]

    Robust speech recognition via large-scale weak supervi- sion,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

  30. [30]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  31. [31]

    Linguistic typology,

    B. Comrie, “Linguistic typology,”Annual Review of Anthropology, vol. 17, pp. 145–159, 1988

  32. [32]

    Haspelmath and A

    M. Haspelmath and A. Sims,Understanding morphology. Routledge, 2013

  33. [33]

    Order of subject, object and verb,

    M. S. Dryer, “Order of subject, object and verb,” inThe World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath, Eds. Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013. [Online]. Available: https://wals.info/chapter/81

  34. [34]

    J. A. Hawkins,Word order universals. Elsevier, 2014, vol. 3

  35. [35]

    Analysis of multi- source language training in cross-lingual transfer,

    S. Lim, T. Yun, J. Kim, J. Choi, and T. Kim, “Analysis of multi- source language training in cross-lingual transfer,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2024, pp. 712–725

  36. [36]

    Film: Visual reasoning with a general conditioning layer,

    E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inAAAI, 2018

  37. [37]

    Stablevc: Style controllable zero-shot voice conversion with conditional flow matching,

    J. Yao, Y . Yuguang, Y . Pan, Z. Ning, J. Ye, H. Zhou, and L. Xie, “Stablevc: Style controllable zero-shot voice conversion with conditional flow matching,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 669–25 677

  38. [38]

    Connection- ist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,

    A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” inICML, 2006

  39. [39]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  40. [40]

    Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,

    S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration- controlled auto-regressive zero-shot text-to-speech,”arXiv preprint arXiv:2506.21619, 2025

  41. [41]

    CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

  42. [42]

    Fireredtts-2: Towards long conversational speech generation for podcast and chatbot,

    K. Xie, F. Shen, J. Li, F. Xie, X. Tang, and Y . Hu, “Fireredtts-2: Towards long conversational speech generation for podcast and chatbot,”arXiv preprint arXiv:2509.02020, 2025

  43. [43]

    Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching,

    H. Zhu, W. Kang, Z. Yao, L. Guo, F. Kuang, Z. Li, W. Zhuang, L. Lin, and D. Povey, “Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching,”arXiv preprint arXiv:2506.13053, 2025

  44. [44]

    V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,

    Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxcpm: Tokenizer-free tts for context-aware speech gen- eration and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025

  45. [45]

    Unity: Two-pass direct speech-to-speech translation with discrete units,

    H. Inaguma, S. Popuri, I. Kulikov, P.-J. Chen, C. Wang, Y .-A. Chung, Y . Tang, A. Lee, S. Watanabe, and J. Pino, “Unity: Two-pass direct speech-to-speech translation with discrete units,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 15 655–15 680

  46. [46]

    Stream- speech: Simultaneous speech-to-speech translation with multi-task learn- ing,

    S. Zhang, Q. Fang, S. Guo, Z. Ma, M. Zhang, and Y . Feng, “Stream- speech: Simultaneous speech-to-speech translation with multi-task learn- ing,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8964– 8986

  47. [47]

    High-fidelity simultaneous speech-to-speech translation,

    T. Labiausse, L. Mazar ´e, E. Grave, A. D ´efossez, and N. Zeghidour, “High-fidelity simultaneous speech-to-speech translation,” inProceed- ings of the 42nd International Conference on Machine Learning. PMLR, 2025, pp. 32 116–32 129

  48. [48]

    CoV oST 2 and Massively Multilingual Speech Translation,

    C. Wang, A. Wu, J. Gu, and J. Pino, “CoV oST 2 and Massively Multilingual Speech Translation,” inInterspeech 2021, 2021, pp. 2247– 2251

  49. [49]

    Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,

    T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, 2018, pp. 66–71

  50. [50]

    Bleu: a method for automatic evaluation of machine translation,

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318

  51. [51]

    Comet-22: Unbabel-ist 2022 submission for the metrics shared task,

    R. Rei, J. G. De Souza, D. Alves, C. Zerva, A. C. Farinha, T. Glushkova, A. Lavie, L. Coheur, and A. F. Martins, “Comet-22: Unbabel-ist 2022 submission for the metrics shared task,” inProceedings of the Seventh Conference on Machine Translation (WMT), 2022, pp. 578–585