From Flat Language Labels to Typological Priors: Structured Language Conditioning for Multilingual Speech-to-Speech Translation
Pith reviewed 2026-05-20 19:21 UTC · model grok-4.3
The pith
Replacing flat language labels with structured typological priors improves multilingual speech-to-speech translation performance and data efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol on CVSS-C by systematically replacing flat language labels with structured typological priors through typology-informed hierarchical language encoding, dynamically-gated language-aware Dual-CTC, and typology-aware LLM prompting, with ablation studies confirming complementary benefits from the three strategies and controlled analyses indicating improved data efficiency in low-resource many-to-one settings.
What carries the argument
Typology-informed hierarchical language encoding, dynamically-gated language-aware Dual-CTC, and typology-aware LLM prompting as structured replacements for flat language embeddings in a many-to-one compositional S2ST framework.
If this is right
- The representation-level, acoustic-level, and decoding-level strategies provide complementary benefits that together raise average scores across BLEU, COMET, ASR-BLEU, and BLASER 2.0.
- Explicit typological priors yield measurable gains in data efficiency, as shown by competitive results in a Japanese-to-English setup with approximately 3 hours of supervised data.
- Many-to-one compositional S2ST benefits when source-language information is encoded through systematic linguistic structure rather than isolated labels.
- Ablation results indicate that removing any one of the three typology-based components reduces overall performance.
Where Pith is reading between the lines
- The same hierarchical-plus-gated conditioning pattern could be tested on other speech tasks such as multilingual recognition or voice conversion to check for similar efficiency gains.
- Typological priors might reduce the amount of new paired data needed when adding languages that share structural traits with existing ones.
- The approach highlights a route for incorporating external linguistic resources directly into neural conditioning without requiring full retraining for each new language pair.
Load-bearing premise
Typological features drawn from external linguistic databases supply reliable inductive biases that generalize to the acoustic and semantic demands of speech-to-speech translation rather than merely correlating with surface-level language identity.
What would settle it
An ablation that replaces the typological priors with random or language-identity-preserving but structurally shuffled features and measures whether the reported gains on CVSS-C and the 3-hour Japanese-to-English setup disappear.
Figures
read the original abstract
Compositional speech-to-speech translation (S2ST) systems built upon speech large language models (SpeechLLMs) have recently shown promising performance. However, existing S2ST systems often either neglect source-language information or encode it through a language-as-label paradigm, representing each source language as an independent flat embedding. Such a design overlooks systematic linguistic structure shared across languages, which may limit data-efficient multilingual adaptation when supervised S2ST data are scarce. To address this issue, we propose S2ST-Omni 2, a many-to-one compositional S2ST framework that systematically reformulates multilingual language conditioning from flat language labels to structured typological priors. Specifically, S2ST-Omni 2 revisits language conditioning at three levels: typology-informed hierarchical language encoding for structured source-language representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder-side linguistic guidance. Experiments on CVSS-C show that S2ST-Omni 2 achieves superior average performance among representative S2ST approaches across BLEU, COMET, ASR-BLEU, and BLASER 2.0 under the adopted evaluation protocol. Ablation studies indicate that the proposed representation-level, acoustic-level, and decoding-level strategies provide complementary benefits. Moreover, controlled data-budget analyses and a Japanese-to-English evaluation using only approximately 3 hours of supervised training data suggest that explicit typological priors provide useful inductive biases for data-efficient multilingual S2ST.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes S2ST-Omni 2, a many-to-one compositional speech-to-speech translation framework that replaces flat language labels with structured typological priors drawn from external linguistic databases. The method applies these priors at three levels: typology-informed hierarchical language encoding for source representation, dynamically-gated language-aware Dual-CTC for content-adaptive acoustic modulation, and typology-aware LLM prompting for decoder guidance. On the CVSS-C benchmark, S2ST-Omni 2 reports superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 relative to representative S2ST baselines. Ablation studies indicate complementary benefits from the three conditioning strategies, while controlled data-budget experiments and a Japanese-to-English setup using approximately 3 hours of supervised data suggest gains in data efficiency.
Significance. If the central attribution to typological inductive biases holds, the work would offer a practical advance for data-efficient multilingual S2ST by leveraging cross-lingual linguistic structure rather than language-as-label embeddings. The multi-level integration and explicit use of external typological resources constitute a clear methodological contribution, and the low-resource Japanese-to-English result provides a concrete demonstration of potential utility. The paper already supplies ablation evidence of complementarity and benchmark comparisons; strengthening the isolation of typology from general conditioning richness would elevate its impact.
major comments (2)
- [Ablation studies] Ablation studies: the reported complementarity among representation-level, acoustic-level, and decoding-level strategies does not include a control condition that supplies structured language embeddings of matched dimensionality and richness but without typological content (e.g., random or language-identity-only hierarchical vectors). Without this control, it remains unclear whether performance and data-efficiency gains derive from typological inductive biases or simply from richer, non-flat conditioning signals.
- [Experiments on CVSS-C] Experiments on CVSS-C: the claim of superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 is presented without statistical significance tests, error bars, or exhaustive specification of baseline hyper-parameters and data splits. These omissions reduce confidence that the observed margins reliably exceed those attributable to implementation variance.
minor comments (2)
- [Abstract] The abstract and method sections would benefit from an explicit statement of which typological database (e.g., WALS) and feature subset are used, together with a brief justification for feature selection and weighting.
- Notation for the hierarchical encoder and gated Dual-CTC modules could be clarified by adding one or two equations that define the gating mechanism and the typology-informed embedding construction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: Ablation studies: the reported complementarity among representation-level, acoustic-level, and decoding-level strategies does not include a control condition that supplies structured language embeddings of matched dimensionality and richness but without typological content (e.g., random or language-identity-only hierarchical vectors). Without this control, it remains unclear whether performance and data-efficiency gains derive from typological inductive biases or simply from richer, non-flat conditioning signals.
Authors: We appreciate this suggestion for isolating the role of typological content. Our existing ablations demonstrate complementary gains by ablating each of the three conditioning strategies in turn, with the full model outperforming partial variants on CVSS-C and in low-resource settings. However, we agree that a matched non-typological control (e.g., random hierarchical vectors) would more directly test whether gains arise from typological structure rather than conditioning richness alone. In the revised version we will add this control experiment using random hierarchical embeddings of identical dimensionality and report the results alongside the existing ablations. revision: yes
-
Referee: Experiments on CVSS-C: the claim of superior average performance across BLEU, COMET, ASR-BLEU, and BLASER 2.0 is presented without statistical significance tests, error bars, or exhaustive specification of baseline hyper-parameters and data splits. These omissions reduce confidence that the observed margins reliably exceed those attributable to implementation variance.
Authors: We acknowledge that additional statistical rigor and experimental detail would increase confidence in the reported improvements. In the revision we will (i) report error bars from multiple random seeds, (ii) include statistical significance tests (paired bootstrap or t-tests) for key comparisons against baselines, and (iii) provide exhaustive hyper-parameter settings and data-split specifications for all baselines and our model in a new appendix section. revision: yes
Circularity Check
No significant circularity; gains shown via external benchmarks and ablations rather than internal fits or self-citations
full rationale
The paper's central claims rest on empirical evaluation of S2ST-Omni 2 on the CVSS-C benchmark using BLEU, COMET, ASR-BLEU, and BLASER 2.0, plus controlled data-budget experiments on a ~3-hour Japanese-English split. The three conditioning strategies (hierarchical encoding, gated Dual-CTC, LLM prompting) are presented as architectural choices informed by external typological databases (e.g., WALS-style features), with ablations demonstrating complementarity. No equations or derivations reduce a reported prediction to a fitted parameter by construction, and no load-bearing self-citations or uniqueness theorems are invoked. Performance numbers derive from held-out test sets rather than being forced by the model's own training objectives or normalizations, satisfying the criterion for a self-contained empirical result against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- typological feature selection and weighting
axioms (1)
- domain assumption Typological features from linguistic resources accurately capture cross-lingual similarities relevant to acoustic and semantic translation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
typology-informed hierarchical language encoding (TI-HLE) decomposes source-language information into morphology-related, reordering, genealogical-family, and residual language-specific channels
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Creating corpora for speech-to-speech translation
G.-i. Kikui, E. Sumita, T. Takezawa, and S. Yamamoto, “Creating corpora for speech-to-speech translation.” inINTERSPEECH, 2003, pp. 381–384
work page 2003
-
[2]
Textless speech-to-speech translation on real data,
A. Lee, H. Gong, P.-A. Duquenne, H. Schwenk, P.-J. Chen, C. Wang, S. Popuri, Y . Adi, J. Pino, J. Guet al., “Textless speech-to-speech translation on real data,” inProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 860–872
work page 2022
-
[3]
Conformer: Convolution- augmented Transformer for Speech Recognition,
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wu, and R. Pang, “Conformer: Convolution- augmented Transformer for Speech Recognition,” inInterspeech 2020, 2020, pp. 5036–5040
work page 2020
-
[4]
Hybridformer: Improving squeezeformer with hybrid attention and nsr mechanism,
Y . Yang, Y . Pan, J. Yin, J. Han, L. Ma, and H. Lu, “Hybridformer: Improving squeezeformer with hybrid attention and nsr mechanism,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
work page 2023
-
[5]
Adaptive machine trans- lation with large language models,
Y . Moslem, R. Haque, J. Kelleher, and A. Way, “Adaptive machine trans- lation with large language models,” inProceedings of the 24th Annual Conference of the European Association for Machine Translation, 2023, pp. 227–237
work page 2023
-
[6]
Towards making the most of chatgpt for machine translation,
K. Peng, L. Ding, Q. Zhong, L. Shen, X. Liu, M. Zhang, Y . Ouyang, and D. Tao, “Towards making the most of chatgpt for machine translation,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 5622–5633
work page 2023
-
[7]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Takin: A cohort of superior quality zero-shot speech generation models,
S. Chen, Y . Feng, L. He, T. He, W. He, Y . Hu, B. Lin, Y . Lin, Y . Pan, P. Tanet al., “Takin: A cohort of superior quality zero-shot speech generation models,”arXiv preprint arXiv:2409.12139, 2024
-
[9]
Direct Speech-to-Speech Translation with a Sequence-to- Sequence Model,
Y . Jia, R. J. Weiss, F. Biadsy, W. Macherey, M. Johnson, Z. Chen, and Y . Wu, “Direct Speech-to-Speech Translation with a Sequence-to- Sequence Model,” inInterspeech 2019, 2019, pp. 1123–1127
work page 2019
-
[10]
Translatotron 2: High-quality direct speech-to-speech translation with voice preserva- tion,
Y . Jia, M. T. Ramanovich, T. Remez, and R. Pomerantz, “Translatotron 2: High-quality direct speech-to-speech translation with voice preserva- tion,” inInternational Conference on Machine Learning. PMLR, 2022, pp. 10 120–10 134
work page 2022
-
[11]
C.; Dale, D.; Dong, N.; Duquenne, P.-A.; Elsahar, H.; Gong, H.; Heffernan, K.; Hoffman, J.; et al
L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, P.-A. Duquenne, H. Elsahar, H. Gong, K. Heffernan, J. Hoffmanet al., “Seamlessm4t: massively multilingual & multimodal machine transla- tion,”arXiv preprint arXiv:2308.11596, 2023. 10
-
[12]
Can we achieve high-quality direct speech-to-speech translation without parallel speech data?
Q. Fang, S. Zhang, Z. Ma, M. Zhang, and Y . Feng, “Can we achieve high-quality direct speech-to-speech translation without parallel speech data?” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 7264– 7277
work page 2024
-
[13]
Daspeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation,
Q. Fang, Y . Zhou, and Y . Feng, “Daspeech: Directed acyclic transformer for fast and high-quality speech-to-speech translation,”Advances in Neural Information Processing Systems, vol. 36, pp. 72 604–72 623, 2023
work page 2023
-
[14]
Y . Pan, X. Wu, Y . Yang, J. Yao, C. Maxime, L. Ma, and J. Zhao, “S2st- omni: Hierarchical language-aware speechllm adaptation for multilin- gual speech-to-speech translation,”arXiv preprint arXiv:2506.11160, 2025
-
[15]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[16]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities
D. Zhanget al., “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,”arXiv preprint arXiv:2305.11000, 2023
-
[19]
Audiogpt: Understanding and generating speech, music, sound, and talking head,
R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J. Huang, J. Liuet al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 802–23 804
work page 2024
-
[20]
Simuls2s-llm: Unlock- ing simultaneous inference of speech llms for speech-to-speech transla- tion,
K. Deng, W. Chen, X. Chen, and P. Woodland, “Simuls2s-llm: Unlock- ing simultaneous inference of speech llms for speech-to-speech transla- tion,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 16 718– 16 734
work page 2025
-
[21]
Rosettaspeech: Zero-shot speech-to-speech translation from monolingual data,
Z. Zheng, X. Sun, T. Dinh, A. Yanamandra, A. Jain, Z. Liu, S. Hadap, V . Bhat, M. Aggarwal, G. Medioniet al., “Rosettaspeech: Zero-shot speech-to-speech translation from monolingual data,”arXiv preprint arXiv:2511.20974, 2025
-
[22]
Comrie,Language universals and linguistic typology: Syntax and morphology
B. Comrie,Language universals and linguistic typology: Syntax and morphology. University of Chicago press, 1989
work page 1989
-
[23]
Uriel and lang2vec: Representing languages as typological, geographi- cal, and phylogenetic vectors,
P. Littell, D. R. Mortensen, K. Lin, K. Kairis, C. Turner, and L. Levin, “Uriel and lang2vec: Representing languages as typological, geographi- cal, and phylogenetic vectors,” inProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 2017, pp. 8–14
work page 2017
-
[24]
E. M. Ponti, H. O’horan, Y . Berzak, I. Vuli ´c, R. Reichart, T. Poibeau, E. Shutova, and A. Korhonen, “Modeling language variation and universals: A survey on typological linguistics for natural language processing,”Computational Linguistics, vol. 45, no. 3, pp. 559–601, 2019
work page 2019
-
[25]
A. Oncevay, B. Haddow, and A. Birch, “Bridging linguistic typology and multilingual machine translation with multi-view language representa- tions,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020, pp. 2391–2406
work page 2020
-
[26]
Unifying cross-lingual transfer across scenarios of resource scarcity,
A. Ansell, M. Parovi ´c, I. Vuli ´c, A. Korhonen, and E. M. Ponti, “Unifying cross-lingual transfer across scenarios of resource scarcity,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 3980–3995
work page 2023
-
[27]
Analyzing the evaluation of cross-lingual knowledge transfer in multilingual language models,
S. Rajaee and C. Monz, “Analyzing the evaluation of cross-lingual knowledge transfer in multilingual language models,” inProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 2895– 2914
work page 2024
-
[28]
Cvss corpus and massively multilingual speech-to-speech translation,
Y . Jia, M. T. Ramanovich, Q. Wang, and H. Zen, “Cvss corpus and massively multilingual speech-to-speech translation,” inProceedings of the thirteenth language resources and evaluation conference, 2022, pp. 6691–6703
work page 2022
-
[29]
Robust speech recognition via large-scale weak supervi- sion,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervi- sion,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
work page 2023
-
[30]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
B. Comrie, “Linguistic typology,”Annual Review of Anthropology, vol. 17, pp. 145–159, 1988
work page 1988
- [32]
-
[33]
Order of subject, object and verb,
M. S. Dryer, “Order of subject, object and verb,” inThe World Atlas of Language Structures Online, M. S. Dryer and M. Haspelmath, Eds. Leipzig: Max Planck Institute for Evolutionary Anthropology, 2013. [Online]. Available: https://wals.info/chapter/81
work page 2013
-
[34]
J. A. Hawkins,Word order universals. Elsevier, 2014, vol. 3
work page 2014
-
[35]
Analysis of multi- source language training in cross-lingual transfer,
S. Lim, T. Yun, J. Kim, J. Choi, and T. Kim, “Analysis of multi- source language training in cross-lingual transfer,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2024, pp. 712–725
work page 2024
-
[36]
Film: Visual reasoning with a general conditioning layer,
E. Perez, F. Strub, H. De Vries, V . Dumoulin, and A. Courville, “Film: Visual reasoning with a general conditioning layer,” inAAAI, 2018
work page 2018
-
[37]
Stablevc: Style controllable zero-shot voice conversion with conditional flow matching,
J. Yao, Y . Yuguang, Y . Pan, Z. Ning, J. Ye, H. Zhou, and L. Xie, “Stablevc: Style controllable zero-shot voice conversion with conditional flow matching,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 25 669–25 677
work page 2025
-
[38]
A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Connection- ist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” inICML, 2006
work page 2006
-
[39]
Lora: Low-rank adaptation of large language models
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022
work page 2022
-
[40]
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration- controlled auto-regressive zero-shot text-to-speech,”arXiv preprint arXiv:2506.21619, 2025
-
[41]
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Fireredtts-2: Towards long conversational speech generation for podcast and chatbot,
K. Xie, F. Shen, J. Li, F. Xie, X. Tang, and Y . Hu, “Fireredtts-2: Towards long conversational speech generation for podcast and chatbot,”arXiv preprint arXiv:2509.02020, 2025
-
[43]
Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching,
H. Zhu, W. Kang, Z. Yao, L. Guo, F. Kuang, Z. Li, W. Zhuang, L. Lin, and D. Povey, “Zipvoice: Fast and high-quality zero-shot text-to-speech with flow matching,”arXiv preprint arXiv:2506.13053, 2025
-
[44]
V oxcpm: Tokenizer-free tts for context-aware speech generation and true-to-life voice cloning,
Y . Zhou, G. Zeng, X. Liu, X. Li, R. Yu, Z. Wang, R. Ye, W. Sun, J. Gui, K. Liet al., “V oxcpm: Tokenizer-free tts for context-aware speech gen- eration and true-to-life voice cloning,”arXiv preprint arXiv:2509.24650, 2025
-
[45]
Unity: Two-pass direct speech-to-speech translation with discrete units,
H. Inaguma, S. Popuri, I. Kulikov, P.-J. Chen, C. Wang, Y .-A. Chung, Y . Tang, A. Lee, S. Watanabe, and J. Pino, “Unity: Two-pass direct speech-to-speech translation with discrete units,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 15 655–15 680
work page 2023
-
[46]
Stream- speech: Simultaneous speech-to-speech translation with multi-task learn- ing,
S. Zhang, Q. Fang, S. Guo, Z. Ma, M. Zhang, and Y . Feng, “Stream- speech: Simultaneous speech-to-speech translation with multi-task learn- ing,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8964– 8986
work page 2024
-
[47]
High-fidelity simultaneous speech-to-speech translation,
T. Labiausse, L. Mazar ´e, E. Grave, A. D ´efossez, and N. Zeghidour, “High-fidelity simultaneous speech-to-speech translation,” inProceed- ings of the 42nd International Conference on Machine Learning. PMLR, 2025, pp. 32 116–32 129
work page 2025
-
[48]
CoV oST 2 and Massively Multilingual Speech Translation,
C. Wang, A. Wu, J. Gu, and J. Pino, “CoV oST 2 and Massively Multilingual Speech Translation,” inInterspeech 2021, 2021, pp. 2247– 2251
work page 2021
-
[49]
T. Kudo and J. Richardson, “Sentencepiece: A simple and language inde- pendent subword tokenizer and detokenizer for neural text processing,” inProceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations, 2018, pp. 66–71
work page 2018
-
[50]
Bleu: a method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” inProceedings of the 40th annual meeting of the Association for Computational Linguistics, 2002, pp. 311–318
work page 2002
-
[51]
Comet-22: Unbabel-ist 2022 submission for the metrics shared task,
R. Rei, J. G. De Souza, D. Alves, C. Zerva, A. C. Farinha, T. Glushkova, A. Lavie, L. Coheur, and A. F. Martins, “Comet-22: Unbabel-ist 2022 submission for the metrics shared task,” inProceedings of the Seventh Conference on Machine Translation (WMT), 2022, pp. 578–585
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.