Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Christoph Schmidt; Dietlind Z\"uhlke; Julia Pritzen; Michael Gref

arxiv: 2105.12708 · v3 · submitted 2021-05-26 · 💻 cs.CL · cs.SD· eess.AS

Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

Julia Pritzen , Michael Gref , Dietlind Z\"uhlke , Christoph Schmidt This is my paper

Pith reviewed 2026-05-24 13:09 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords multitask learninggrapheme-to-phoneme conversionAnglicismsGerman speech recognitionsequence-to-sequence modelspronunciation dictionaryword error rate

0 comments

The pith

A multitask sequence-to-sequence model with an added classifier for Anglicisms produces more accurate phoneme sequences for English loanwords in German and lowers error rates when the outputs are used in speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a grapheme-to-phoneme conversion system that performs two tasks at once: converting spelling to sound and deciding whether each word is an Anglicism or a native German word. The classification decision conditions the phoneme generation so the model applies different mapping rules to the two classes. When the resulting dictionaries are merged into an existing German automatic speech recognition system, performance improves on a held-out set of Anglicisms. The measured gains are a 1 percent drop in overall word error rate and a 3 percent drop in the Anglicism-specific error rate. The approach is presented as a way to handle the irregular pronunciations that standard dictionary-generation tools produce for loanwords.

Core claim

Extending a grapheme-to-phoneme model with a joint classifier that distinguishes Anglicisms from native German words enables the model to generate pronunciations differently depending on the classification result; the resulting supplementary dictionaries, when added to a baseline German speech recognizer, reduce word error rate by 1 percent and Anglicism error rate by 3 percent on a dedicated evaluation set.

What carries the argument

A multitask sequence-to-sequence architecture that jointly performs grapheme-to-phoneme conversion and binary classification of input words as Anglicisms versus native German words.

If this is right

Supplementary pronunciation dictionaries generated by the model can be merged directly into existing German ASR systems.
The same multitask structure can be used to produce pronunciation lexicons for other classes of words that deviate from standard German phonotactics.
The joint training objective encourages the encoder to extract features useful for both classification and phoneme prediction.
Improvements are measured on a dedicated Anglicism evaluation set rather than only on general German test data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be extended to other languages that import many words with foreign phonology, such as French or Japanese loanword handling.
If the classifier output were made available at inference time, downstream systems could route words to different pronunciation models without retraining the entire recognizer.
The approach assumes that the binary distinction is sufficient; finer-grained labels for word origin might yield further gains but are not tested here.

Load-bearing premise

The auxiliary classifier must separate Anglicisms from native words reliably enough that conditioning phoneme generation on the binary label improves accuracy on loanwords without degrading performance on native German words.

What would settle it

An experiment in which the classifier is replaced by random labels or by a low-accuracy classifier and the joint model then shows no reduction in Anglicism error rate on the dedicated test set.

Figures

Figures reproduced from arXiv: 2105.12708 by Christoph Schmidt, Dietlind Z\"uhlke, Julia Pritzen, Michael Gref.

read the original abstract

Anglicisms are a challenge in German speech recognition. Due to their irregular pronunciation compared to native German words, automatically generated pronunciation dictionaries often include faulty phoneme sequences for Anglicisms. In this work, we propose a multitask sequence-to-sequence approach for grapheme-to-phoneme conversion to improve the phonetization of Anglicisms. We extended a grapheme-to-phoneme model with a classifier to distinguish Anglicisms from native German words. With this approach, the model learns to generate pronunciations differently depending on the classification result. We used our model to create supplementary Anglicism pronunciation dictionaries that are added to an existing German speech recognition model. Tested on a dedicated Anglicism evaluation set, we improved the recognition of Anglicisms compared to a baseline model, reducing the word error rate by 1 % and the Anglicism error rate by 3 %. We show that multitask learning can help solving the challenge of Anglicisms in German speech recognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes a multitask seq2seq G2P model for German that adds an auxiliary classifier to distinguish Anglicisms from native words, allowing pronunciation generation to be conditioned on the classification output. The resulting model is used to generate supplementary pronunciation dictionaries that are added to an existing ASR system; on a dedicated Anglicism evaluation set this yields a 1% absolute WER reduction and a 3% reduction in Anglicism error rate relative to a baseline.

Significance. If the central claim holds, the work supplies a concrete, deployable technique for improving loanword handling in German ASR via multitask learning. The empirical gains are modest but directly measured on the target error type; the approach is simple enough to be reproducible and could be adopted in production dictionaries.

major comments (1)

[Abstract / Results] Abstract and evaluation description: the reported 1% WER and 3% Anglicism-error reductions are measured only on a dedicated Anglicism test set. No numbers are supplied for overall WER on a standard German test set, for native-word subsets, or for any ablation that isolates the classifier branch. Because the central claim is that the multitask construction improves Anglicism phonetization “without side effects,” the absence of these measurements is load-bearing; degradation on native words would negate the practical value of the supplementary dictionaries.

minor comments (2)

[Abstract] The abstract does not state the size of the Anglicism evaluation set, the baseline G2P or ASR model details, or whether any statistical significance test was performed on the 1% / 3% deltas.
Notation for the joint loss or the way the classifier output is injected into the decoder is not described in the provided abstract; a short methods paragraph would clarify the architecture.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The concern about missing baseline measurements on standard test sets is valid and directly addresses the practical utility of the proposed supplementary dictionaries. We address this point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / Results] Abstract and evaluation description: the reported 1% WER and 3% Anglicism-error reductions are measured only on a dedicated Anglicism test set. No numbers are supplied for overall WER on a standard German test set, for native-word subsets, or for any ablation that isolates the classifier branch. Because the central claim is that the multitask construction improves Anglicism phonetization “without side effects,” the absence of these measurements is load-bearing; degradation on native words would negate the practical value of the supplementary dictionaries.

Authors: We agree that the absence of these measurements weakens the claim of no side effects. The supplementary dictionaries contain only Anglicism entries generated by the multitask model, so native-word entries in the baseline lexicon remain unchanged; however, this does not fully address whether the underlying multitask G2P model itself would produce different (potentially worse) pronunciations for native words if applied to them. In the revision we will add (i) overall WER on a standard German test set (e.g., Common Voice German), (ii) separate native-word and Anglicism subsets on that set, and (iii) an ablation comparing the full multitask model against a single-task G2P baseline on both subsets. These additions will either confirm the absence of degradation or allow us to qualify the claims. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical multitask model evaluated on external test set

full rationale

The paper proposes and trains a multitask seq2seq G2P model with an auxiliary classifier, generates pronunciation dictionaries, and measures WER/Anglicism error reductions on a dedicated held-out Anglicism evaluation set against a baseline. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described approach. All reported gains are direct empirical measurements on external data, with no load-bearing steps that reduce to the inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are described in the abstract; the work is an applied empirical experiment in machine learning.

pith-pipeline@v0.9.0 · 5711 in / 1109 out tokens · 27711 ms · 2026-05-24T13:09:17.311039+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference , pages 4218--4222, Marseille, France, May. European Language Resources Association

work page 2020
[2]

Bavarian Archive for Speech Signals . (2013). Pronunciation Lexicon PHONOLEX . https://www.phonetik.uni-muenchen.de/forschung/Bas/BasPHONOLEXeng.html

work page 2013
[3]

Elfers, A. (2020). Der Anglizismen-Index 2020 - Deutsch statt Denglisch . https://vds-ev.de/denglisch-und-anglizismen/anglizismenindex/ag-anglizismenindex/

work page 2020
[4]

Stadtschnitzer, M., Schwenninger, J., Stein, D., and Koehler, J. (2014). Exploiting the large-scale G erman broadcast corpus to boost the F raunhofer IAIS S peech R ecognition S ystem. In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14) , pages 3887--3890, Reykjavik, Iceland, May. European Language Resourc...

work page 2014
[5]

Wiktionary. (2019). Verzeichnis:Deutsch/Anglizismen . https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Anglizismen

work page 2019
[6]

Wiktionary. (2020). Verzeichnis:Deutsch/Anglizismen/Schein\-angli\-zismen . https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Anglizismen/Scheinanglizismen

work page 2020
[7]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page
[8]

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

work page 2020
[9]

and Ney, H

Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion . Speech Communication , 50(5):434--451

work page 2008
[10]

Burmasova, S. (2010). Empirische Untersuchung der Anglizismen im Deutschen am Material der Zeitung Die WELT (Jahrgänge 1994 und 2004)

work page 2010
[11]

Caruana, R. (1993). Multitask Learning: A Knowledge-Based Source of Inductive Bias . In Proceedings of the Tenth International Conference on Machine Learning , pages 41--48. Morgan Kaufmann

work page 1993
[12]

Caruana, R. (1997). Multitask Learning . Ph.D. thesis, Carnegie Mellon University

work page 1997
[13]

Coats, S. (2019). Lexicon geupdated: New German anglicisms in a social media corpus . European Journal of Applied Linguistics , 7:255 -- 280

work page 2019
[14]

Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. (2020). Unsupervised Cross-lingual Representation Learning for Speech Recognition

work page 2020
[15]

Gref, M., Schmidt, C., Behnke, S., and K \" o hler, J. (2019). Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews . In IEEE International Conference on Multimedia and Expo, ICME 2019, Shanghai, China, July 8-12, 2019 , pages 796--801. IEEE

work page 2019
[16]

Grosman, J. (2021). Wav2Vec2-Large-XLSR-53-German . https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german

work page 2021
[17]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory . Neural Computation , 9(8):1735--1780, November

work page 1997
[18]

Hunt, J. W. (2019). Anglicisms in German: tsunami or trickle? In Amei Koll-Stobbe, editor, Informalization and Hybridization of Speech Practices: Polylingual Meaning-Making across Domains, Genres, and Media , pages 25--58. Peter Lang, Bern, Schweiz

work page 2019
[19]

Max Planck Institute for Psycholinguistics, The Language Archive . (2020). ELAN (Version 5.9) and Simple-ELAN (Version 1.3) [Computer software] . https://archive.mpi.nl/tla/elan

work page 2020
[20]

Milde, B., Schmidt, C., and Köhler, J. (2017). Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion . In INTERSPEECH 2017 , pages 2536--2540

work page 2017
[21]

Sokolov, A., Rohlin, T., and Rastrow, A. (2019). Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion . In Interspeech 2019 , pages 2065--2069

work page 2019
[22]

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks . In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , NIPS'14, pages 3104--–3112, Cambridge, MA, USA. MIT Press

work page 2014
[23]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing . In Proceedings of the 2020 ...

work page 2020
[24]

and Zweig, G

Yao, K. and Zweig, G. (2015). Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion . In INTERSPEECH 2015 , pages 3330--3334

work page 2015

[1] [1]

Ardila, R., Branson, M., Davis, K., Kohler, M., Meyer, J., Henretty, M., Morais, R., Saunders, L., Tyers, F., and Weber, G. (2020). Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Language Resources and Evaluation Conference , pages 4218--4222, Marseille, France, May. European Language Resources Association

work page 2020

[2] [2]

Bavarian Archive for Speech Signals . (2013). Pronunciation Lexicon PHONOLEX . https://www.phonetik.uni-muenchen.de/forschung/Bas/BasPHONOLEXeng.html

work page 2013

[3] [3]

Elfers, A. (2020). Der Anglizismen-Index 2020 - Deutsch statt Denglisch . https://vds-ev.de/denglisch-und-anglizismen/anglizismenindex/ag-anglizismenindex/

work page 2020

[4] [4]

Stadtschnitzer, M., Schwenninger, J., Stein, D., and Koehler, J. (2014). Exploiting the large-scale G erman broadcast corpus to boost the F raunhofer IAIS S peech R ecognition S ystem. In Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14) , pages 3887--3890, Reykjavik, Iceland, May. European Language Resourc...

work page 2014

[5] [5]

Wiktionary. (2019). Verzeichnis:Deutsch/Anglizismen . https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Anglizismen

work page 2019

[6] [6]

Wiktionary. (2020). Verzeichnis:Deutsch/Anglizismen/Schein\-angli\-zismen . https://de.wiktionary.org/wiki/Verzeichnis:Deutsch/Anglizismen/Scheinanglizismen

work page 2020

[7] [7]

write newline

" write newline "" before.all 'output.state := FUNCTION fin.entry add.period write newline FUNCTION new.block output.state before.all = 'skip after.block 'output.state := if FUNCTION new.sentence output.state after.block = 'skip output.state before.all = 'skip after.sentence 'output.state := if if FUNCTION not #0 #1 if FUNCTION and 'skip pop #0 if FUNCTIO...

work page

[8] [8]

Baevski, A., Zhou, H., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

work page 2020

[9] [9]

and Ney, H

Bisani, M. and Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion . Speech Communication , 50(5):434--451

work page 2008

[10] [10]

Burmasova, S. (2010). Empirische Untersuchung der Anglizismen im Deutschen am Material der Zeitung Die WELT (Jahrgänge 1994 und 2004)

work page 2010

[11] [11]

Caruana, R. (1993). Multitask Learning: A Knowledge-Based Source of Inductive Bias . In Proceedings of the Tenth International Conference on Machine Learning , pages 41--48. Morgan Kaufmann

work page 1993

[12] [12]

Caruana, R. (1997). Multitask Learning . Ph.D. thesis, Carnegie Mellon University

work page 1997

[13] [13]

Coats, S. (2019). Lexicon geupdated: New German anglicisms in a social media corpus . European Journal of Applied Linguistics , 7:255 -- 280

work page 2019

[14] [14]

Conneau, A., Baevski, A., Collobert, R., Mohamed, A., and Auli, M. (2020). Unsupervised Cross-lingual Representation Learning for Speech Recognition

work page 2020

[15] [15]

Gref, M., Schmidt, C., Behnke, S., and K \" o hler, J. (2019). Two-Staged Acoustic Modeling Adaption for Robust Speech Recognition by the Example of German Oral History Interviews . In IEEE International Conference on Multimedia and Expo, ICME 2019, Shanghai, China, July 8-12, 2019 , pages 796--801. IEEE

work page 2019

[16] [16]

Grosman, J. (2021). Wav2Vec2-Large-XLSR-53-German . https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-german

work page 2021

[17] [17]

and Schmidhuber, J

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory . Neural Computation , 9(8):1735--1780, November

work page 1997

[18] [18]

Hunt, J. W. (2019). Anglicisms in German: tsunami or trickle? In Amei Koll-Stobbe, editor, Informalization and Hybridization of Speech Practices: Polylingual Meaning-Making across Domains, Genres, and Media , pages 25--58. Peter Lang, Bern, Schweiz

work page 2019

[19] [19]

Max Planck Institute for Psycholinguistics, The Language Archive . (2020). ELAN (Version 5.9) and Simple-ELAN (Version 1.3) [Computer software] . https://archive.mpi.nl/tla/elan

work page 2020

[20] [20]

Milde, B., Schmidt, C., and Köhler, J. (2017). Multitask Sequence-to-Sequence Models for Grapheme-to-Phoneme Conversion . In INTERSPEECH 2017 , pages 2536--2540

work page 2017

[21] [21]

Sokolov, A., Rohlin, T., and Rastrow, A. (2019). Neural Machine Translation for Multilingual Grapheme-to-Phoneme Conversion . In Interspeech 2019 , pages 2065--2069

work page 2019

[22] [22]

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to Sequence Learning with Neural Networks . In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 , NIPS'14, pages 3104--–3112, Cambridge, MA, USA. MIT Press

work page 2014

[23] [23]

L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. (2020). Transformers: State-of-the-Art Natural Language Processing . In Proceedings of the 2020 ...

work page 2020

[24] [24]

and Zweig, G

Yao, K. and Zweig, G. (2015). Sequence-to-Sequence Neural Net Models for Grapheme-to-Phoneme Conversion . In INTERSPEECH 2015 , pages 3330--3334

work page 2015