Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Chuxiong Zhang; Ming Li; Xiaoyi Qin; Yaogen Yang; Zexin Cai

arxiv: 1907.01749 · v1 · pith:TPLO263Inew · submitted 2019-07-03 · 💻 cs.CL · eess.AS· stat.ML

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Zexin Cai , Yaogen Yang , Chuxiong Zhang , Xiaoyi Qin , Ming Li This is my paper

Pith reviewed 2026-05-25 10:34 UTC · model grok-4.3

classification 💻 cs.CL eess.ASstat.ML

keywords polyphone disambiguationMandarin Chineseconditional neural networktext-to-speech front-endword embeddingsbidirectional RNNhomograph resolutionpronunciation prediction

0 comments

The pith

A conditional neural network using bidirectional RNN sentence encoding and word embeddings disambiguates Mandarin polyphones at 94.69% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a neural system to resolve which pronunciation a polyphonic character should take in Mandarin text, a necessary step before generating speech from written input. A bidirectional recurrent network first builds a representation of the surrounding sentence, after which a prediction network combines the character embedding with additional conditions drawn from a pre-trained word embedding table. The full model reaches 94.69 percent accuracy on a public polyphonic-character test set. Separate runs that supply only sentence-level or only word-level conditions both perform well, indicating that either source of context can support reliable disambiguation. The work focuses on removing homograph errors that arise in the front-end stage of Mandarin text-to-speech pipelines.

Core claim

A conditional neural network architecture composed of a bidirectional recurrent neural network sentence encoder followed by a prediction network that receives the polyphonic character embedding together with multi-level conditional features produces correct pronunciations; when the word-level condition is taken from a pre-trained word-to-vector table the system attains 94.69 percent accuracy on a public dataset, and controlled experiments confirm that both sentence-level and word-level conditions independently support strong performance for Mandarin polyphone disambiguation.

What carries the argument

Conditional neural network with bidirectional RNN sentence encoder plus word-to-vector lookup table supplying conditional features

If this is right

The architecture directly targets the homograph problem that appears in the front-end processing stage of Mandarin text-to-speech systems.
Both sentence-level context from the bidirectional RNN and word-level conditions from the pre-trained table can each produce good disambiguation accuracy.
The prediction network successfully maps the combination of character embedding and conditional features onto the correct pronunciation label.
The same conditional framework can be re-used with different choices of conditioning level without retraining the entire encoder.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported accuracy suggests the method could be inserted into existing Mandarin TTS pipelines with only modest additional latency from the extra embedding lookup.
If the word embeddings were replaced by embeddings trained on domain-specific text, accuracy on technical or conversational polyphones might rise further.
The same sentence-plus-word conditioning pattern could be tested on polyphonic characters in other tonal languages that share similar front-end pronunciation selection problems.

Load-bearing premise

The pre-trained word-to-vector lookup table supplies effective word-level conditional features that meaningfully improve pronunciation prediction over sentence context alone.

What would settle it

An ablation that removes the word-level condition and measures a statistically significant drop below 94 percent accuracy on the same public dataset would falsify the claimed value of the multi-level conditioning approach.

Figures

Figures reproduced from arXiv: 1907.01749 by Chuxiong Zhang, Ming Li, Xiaoyi Qin, Yaogen Yang, Zexin Cai.

**Figure 1.** Figure 1: The network architecture of our proposed system 2. Chinese Polyphonic Characters Except for the monophonic characters in Mandarin Chinese, there are polyphonic characters that refer to those with more than one pronunciations. Specifically, we use a mapping function to formulate the conversion from a character to its corresponding pronunciations. Function f is defined as follows: f : C → P (1) where C den… view at source ↗

read the original abstract

This paper describes a conditional neural network architecture for Mandarin Chinese polyphone disambiguation. The system is composed of a bidirectional recurrent neural network component acting as a sentence encoder to accumulate the context correlations, followed by a prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations. We obtain the word-level condition from a pre-trained word-to-vector lookup table. One goal of polyphone disambiguation is to address the homograph problem existing in the front-end processing of Mandarin Chinese text-to-speech system. Our system achieves an accuracy of 94.69\% on a publicly available polyphonic character dataset. To further validate our choices on the conditional feature, we investigate polyphone disambiguation systems with multi-level conditions respectively. The experimental results show that both the sentence-level and the word-level conditional embedding features are able to attain good performance for Mandarin Chinese polyphone disambiguation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiRNN sentence encoder plus word-embedding conditions reaches 94.69% on a public polyphone set with an ablation showing both levels help, but the result sits without baselines.

read the letter

This paper applies a bidirectional RNN as sentence encoder and feeds polyphone embeddings plus conditions from a pre-trained word lookup table into a prediction head. It reports 94.69% accuracy on a public dataset and includes an ablation that tests sentence-level context alone, word-level features alone, and the combination; both levels reach solid performance and the joint version is best. The architecture is a straightforward conditional extension of standard sequence models rather than a new framework. What is actually new is the explicit multi-level conditioning setup for this exact task. The work is grounded: it uses an external public dataset and off-the-shelf embeddings, so the accuracy number is not internally circular. The ablation supplies direct evidence that the word-level condition adds value beyond sentence context. No load-bearing assumptions about embedding quality are hidden; the results simply show the combination works on the chosen data. A soft spot is the absence of baseline scores from prior polyphone systems in the provided summary. Without those numbers or dataset statistics, the 94.69% is hard to place against existing methods. If the full paper supplies the comparisons and error analysis, the claim strengthens; otherwise the central result remains a single point estimate. This is for TTS practitioners who need a drop-in module for Mandarin front-end processing. A reader building or tuning a synthesis pipeline could extract the conditional design and the ablation pattern. The thinking is clear and the empirical claim is falsifiable, so the paper deserves a serious referee even if revisions are needed on the comparison side.

Referee Report

2 major / 0 minor

Summary. The manuscript describes a conditional neural network for Mandarin Chinese polyphone disambiguation consisting of a bidirectional RNN sentence encoder followed by a prediction head that incorporates polyphonic character embeddings conditioned on sentence-level context and word-level features from a pre-trained word-to-vector lookup table. The central empirical claim is an accuracy of 94.69% on a publicly available polyphonic character dataset, with additional experiments investigating multi-level conditional features and reporting that both sentence-level and word-level conditions attain good performance.

Significance. If substantiated, the work provides a concrete architecture for addressing homograph disambiguation in Mandarin TTS front-end processing. The explicit multi-level ablation on conditional features is a positive element that allows assessment of the contribution of word-level embeddings over sentence context alone. However, the absence of any baseline comparisons or prior-art results limits evaluation of whether the reported accuracy represents a meaningful advance.

major comments (2)

[Experimental Results] Experimental Results section: the central claim reports a single accuracy figure of 94.69% with no baseline comparisons, no prior methods, no error bars, and no dataset statistics (e.g., number of polyphones, train/test split sizes), rendering the performance claim difficult to interpret or verify against standard practice in the field.
[Ablation experiments] Ablation on multi-level conditions: while the text states that both sentence-level and word-level conditions reach good performance, the specific accuracy numbers, differences, and statistical significance for each condition level are not quantified, weakening the validation of the word-level embedding assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed review. We address each of the major comments below and plan to revise the manuscript to incorporate the suggested improvements for better clarity and comparability.

read point-by-point responses

Referee: [Experimental Results] Experimental Results section: the central claim reports a single accuracy figure of 94.69% with no baseline comparisons, no prior methods, no error bars, and no dataset statistics (e.g., number of polyphones, train/test split sizes), rendering the performance claim difficult to interpret or verify against standard practice in the field.

Authors: We agree that providing baseline comparisons, prior methods, error bars, and dataset statistics would strengthen the paper. In the revised version, we will include these elements in the Experimental Results section. Specifically, we will report comparisons to existing approaches in the literature, include standard deviations from multiple training runs as error bars, and detail the dataset composition including the number of unique polyphones and the sizes of the training and test splits. revision: yes
Referee: [Ablation experiments] Ablation on multi-level conditions: while the text states that both sentence-level and word-level conditions reach good performance, the specific accuracy numbers, differences, and statistical significance for each condition level are not quantified, weakening the validation of the word-level embedding assumption.

Authors: We acknowledge the need for quantified results in the ablation study. We will update the manuscript to include the specific accuracy numbers for the sentence-level only, word-level only, and combined conditions, along with the performance differences and any applicable statistical significance measures to better validate the contribution of the multi-level features. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes an empirical neural architecture (BiRNN sentence encoder plus conditional prediction head) trained on external data and evaluated on a publicly available polyphonic character dataset. It reports 94.69% accuracy and includes an ablation over sentence-level vs. word-level conditions drawn from a pre-trained external embedding table. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All load-bearing claims rest on external benchmarks and pre-trained resources rather than internal re-use of the target result, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Model depends on standard neural-network assumptions plus one domain assumption about pre-trained embeddings; no new entities or fitted constants are introduced beyond typical embedding dimensions.

free parameters (1)

embedding dimensions and RNN hidden size
Chosen during model design to fit the task; values not reported in abstract.

axioms (1)

domain assumption Pre-trained word embeddings capture semantic information useful for pronunciation choice
Invoked when word-level condition is fed to the prediction network.

pith-pipeline@v0.9.0 · 5697 in / 1085 out tokens · 26123 ms · 2026-05-25T10:34:13.849342+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

bidirectional recurrent neural network component acting as a sentence encoder ... prediction network that maps the polyphonic character embeddings along with the conditions to corresponding pronunciations ... System CW / CC / CWC ... BLSTM encoder ... fc-layer3 ... 285
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our system achieves an accuracy of 94.69% on a publicly available polyphonic character dataset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 5 internal anchors

[1]

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Introduction The grapheme-to-phoneme (G2P) conversion is a fundamental front-end procedure in the Chinese Text-to-Speech (TTS) syn- thesis system, either the traditional HMM-based speech syn- thesis system [1, 2] or the End-to-End speech synthesis sys- tem [3, 4, 5, 6]. G2P typically generates a sequence of phones from a sequence of characters or grapheme...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

背” can be pro- nounced as either “bei1

Chinese Polyphonic Characters Except for the monophonic characters in Mandarin Chinese, there are polyphonic characters that refer to those with more than one pronunciations. Speciﬁcally, we use a mapping func- tion to formulate the conversion from a character to its corre- sponding pronunciations. Function f is deﬁned as follows: f : C→ P (1) where C den...

work page
[3]

Speciﬁcally, the polyphone disambigua- tion system converts a polyphonic character to its corresponding pinyin

Method Different from the traditional grapheme-to-phoneme (G2P) conversion, the polyphone disambiguation is considered as a classiﬁcation problem. Speciﬁcally, the polyphone disambigua- tion system converts a polyphonic character to its corresponding pinyin. Our proposed system is shown in Figure 1. In terms of the characteristics and properties of the po...

work page
[4]

Experimental Results 4.1. Polyphonic Character Database For training and evaluating our proposed polyphone disam- biguation systems, we use a publicly available dataset from Bei- jing Data-Baker Science and Technology Ltd which contains 150 frequently used polyphonic characters and their 151585 corresponding sentences. We divide the corpus into a training...

work page
[5]

For the prediction module, we adopt three fully connected layers with size 512, 1024 and 285 respectively

The dropout rate of LSTM is set to 0.1 to avoid overﬁtting [31]. For the prediction module, we adopt three fully connected layers with size 512, 1024 and 285 respectively. The output size 285 is equal to the number of all possible pinyins in this polyphonic character database. The activation function of the ﬁrst two fully connected layers is RELU. We use ...

work page
[6]

We use the polyphonic character database described in section 4.1 for train- ing and evaluating since we do not have the personal labelled data used in [15]

which adopts two LSTM layers with size 512 and the NLPIR toolkit [32] for POS tagging on the text. We use the polyphonic character database described in section 4.1 for train- ing and evaluating since we do not have the personal labelled data used in [15]. The approach presented in [15] had compared with other polyphone disambiguation approaches and shown...

work page
[7]

We explore sentence-level encoding vector as a condition as well as the word-level vector ob- tained from a pre-trained word-to-vector lookup table

Conclusions In this paper, we propose a data-driven approach using condi- tional neural network architecture for Mandarin Chinese poly- phone disambiguation. We explore sentence-level encoding vector as a condition as well as the word-level vector ob- tained from a pre-trained word-to-vector lookup table. Re- sults show that the sentence-level conditional...

work page
[8]

Acknowledgments This research was funded in part by the National Natural Sci- ence Foundation of China (61773413), Natural Science Foun- dation of Guangzhou City (201707010363), Six Talent Peaks project in Jiangsu Province (JY-074), Science and Technology Program of Guangzhou City (201903010040)

work page
[9]

An HMM-Based Man- darin Chinese Text-To-Speech System

Y . Qian, F. Soong, Y . Chen, and M. Chu, “An HMM-Based Man- darin Chinese Text-To-Speech System.” in 2006 International Symposium on Chinese Spoken Language Processing (ISCSLP) , 2006, pp. 223–232

work page 2006
[10]

The HMM-Based Speech Synthesis System (HTS) Version 2.0

H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, and K. Tokuda, “The HMM-Based Speech Synthesis System (HTS) Version 2.0.” in 6th ISCA Workshop on Speech Synthesis (SSW-6), 2007, pp. 294–299

work page 2007
[11]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-Speaker Neural Text-to-Speech,” CoRR, vol. abs/1710.07654, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2000
[12]

Deep Voice: Real-time Neural Text-to-Speech

S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, and J. Raiman, “Deep Voice: Real-time Neural Text-to-Speech,” CoRR, vol. abs/1702.07825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J. Raiman, and Y . Zhou, “Deep Voice 2: Multi-Speaker Neural Text-to-Speech,” CoRR, vol. abs/1705.08947, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Pre- dictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryanet al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Pre- dictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783

work page 2018
[15]

Grapheme-to- Phoneme Conversion Using Long Short-Term Memory Recur- rent Neural Networks,

K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to- Phoneme Conversion Using Long Short-Term Memory Recur- rent Neural Networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4225–4229

work page 2015
[16]

The Synthesis Rules in A Chinese Text-to-Speech System,

L.-S. Lee, C.-Y . Tseng, and M. Ouh-Young, “The Synthesis Rules in A Chinese Text-to-Speech System,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 37, no. 9, pp. 1309–1320, 1989

work page 1989
[17]

Grapheme-to-Phoneme Conversion in Chinese TTS System,

H. Dong, J. Tao, and B. Xu, “Grapheme-to-Phoneme Conversion in Chinese TTS System,” in 2004 International Symposium on Chinese Spoken Language Processing (ISCSLP), 2004, pp. 165– 168

work page 2004
[18]

Improved Grapheme-to- Phoneme Conversion for Mandarin TTS,

L. Yi, L. Jian, H. Jie, and Z. Xiong, “Improved Grapheme-to- Phoneme Conversion for Mandarin TTS,” Tsinghua Science & Technology, vol. 14, no. 5, pp. 606–611, 2009

work page 2009
[19]

An Overview of Text-to-Speech Synthesis Techniques,

M. Rashad, H. M. El-Bakry, I. R. Isma’il, and N. Mastorakis, “An Overview of Text-to-Speech Synthesis Techniques,”Latest trends on communications and information technology, pp. 84–89, 2010

work page 2010
[20]

Disambiguation of Chinese Polyphonic Characters,

H. Zhang, J. Yu, W. Zhan, and S. Yu, “Disambiguation of Chinese Polyphonic Characters,” in The First International Workshop on MultiMedia Annotation (MMA2001), vol. 1, 2001, pp. 30–1

work page 2001
[21]

An Efﬁcient Way to Learn Rules for Grapheme-to-Phoneme Conversion in Chinese,

Z. Zirong, C. Min, and C. Eric, “An Efﬁcient Way to Learn Rules for Grapheme-to-Phoneme Conversion in Chinese,” in 2002 In- ternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2002, pp. 59–62

work page 2002
[22]

Disambiguating Effectively Chinese Polyphonic Ambiguity Based on Unify Approach,

F.-L. Huang, “Disambiguating Effectively Chinese Polyphonic Ambiguity Based on Unify Approach,” in 2008 International Conference on Machine Learning and Cybernetics (ICMLC) , vol. 6, 2008, pp. 3242–3246

work page 2008
[23]

A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese,

C. Shan, X. Lei, and K. Yao, “A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese,” in2017 In- ternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2017

work page 2017
[24]

Polyphone Disambiguation Based on Maximum Entropy Model in Mandarin Grapheme-to-Phoneme Conversion,

F. Z. Liu and Y . Zhou, “Polyphone Disambiguation Based on Maximum Entropy Model in Mandarin Grapheme-to-Phoneme Conversion,”Key Engineering Materials, vol. 480-481, pp. 1043– 1048, 2011

work page 2011
[25]

Polyphonic Word Disambiguation with Machine Learning Approaches,

J. Liu, W. Qu, X. Tang, Y . Zhang, and Y . Sun, “Polyphonic Word Disambiguation with Machine Learning Approaches,” in 2010 Fourth International Conference on Genetic and Evolution- ary Computing (ICGEC), 2010, pp. 244–247

work page 2010
[26]

Joint-Sequence Models for Grapheme-to- Phoneme Conversion,

M. Bisani and H. Ney, “Joint-Sequence Models for Grapheme-to- Phoneme Conversion,” Speech communication, vol. 50, no. 5, pp. 434–451, 2008

work page 2008
[27]

Inequality Maximum Entropy Classiﬁer with Character Features for Poly- phone Disambiguation in Mandarin TTS Systems,

X. Mao, D. Yuan, J. Han, D. Huang, and H. Wang, “Inequality Maximum Entropy Classiﬁer with Character Features for Poly- phone Disambiguation in Mandarin TTS Systems,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2007

work page 2007
[28]

Image-to-Image Translation with Conditional Adversarial Networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” in the IEEE conference on computer vision and pattern recognition (CVPR) , 2017, pp. 1125–1134

work page 2017
[29]

Recurrent Neural Network Based Language Model,

T. Mikolov, M. Karaﬁ ´at, L. Burget, J. ˇCernock`y, and S. Khu- danpur, “Recurrent Neural Network Based Language Model,” in Eleventh annual conference of the international speech communi- cation association (ISCA), 2010

work page 2010
[30]

Extensions of Recurrent Neural Network Language Model,

T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock`y, and S. Khudan- pur, “Extensions of Recurrent Neural Network Language Model,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 5528–5531

work page 2011
[31]

Speech Recogni- tion with Deep Recurrent Neural Networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech Recogni- tion with Deep Recurrent Neural Networks,” in 2013 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP), 2013, pp. 6645–6649

work page 2013
[32]

Long Short-Term Memory,

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[33]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y . Bengio, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,” arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[34]

Listen, Attend and Spell: A Neural Network for Large V ocabulary Conversational Speech Recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Attend and Spell: A Neural Network for Large V ocabulary Conversational Speech Recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp. 4960–4964

work page 2016
[35]

Bidirectional Recurrent Neu- ral Networks,

M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neu- ral Networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997

work page 1997
[36]

https://www.data-baker.com/bz dyz.html

work page
[37]

Directional Skip-Gram: Ex- plicitly Distinguishing Left and Right Context for Word Embed- dings,

Y . Song, S. Shi, J. Li, and H. Zhang, “Directional Skip-Gram: Ex- plicitly Distinguishing Left and Right Context for Word Embed- dings,” in the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 2, 2018, pp. 175–180

work page 2018
[38]

https://ai.tencent.com/ailab/nlp/embedding.html

work page
[39]

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929
[40]

NLPIR: A Theoretical Framework for Applying Natural Language Processing to Information Retrieval,

L. Zhou and D. Zhang, “NLPIR: A Theoretical Framework for Applying Natural Language Processing to Information Retrieval,” Journal of the American Society for Information Science and Technology, vol. 54, no. 2, pp. 115–123, 2003

work page 2003

[1] [1]

Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features

Introduction The grapheme-to-phoneme (G2P) conversion is a fundamental front-end procedure in the Chinese Text-to-Speech (TTS) syn- thesis system, either the traditional HMM-based speech syn- thesis system [1, 2] or the End-to-End speech synthesis sys- tem [3, 4, 5, 6]. G2P typically generates a sequence of phones from a sequence of characters or grapheme...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

背” can be pro- nounced as either “bei1

Chinese Polyphonic Characters Except for the monophonic characters in Mandarin Chinese, there are polyphonic characters that refer to those with more than one pronunciations. Speciﬁcally, we use a mapping func- tion to formulate the conversion from a character to its corre- sponding pronunciations. Function f is deﬁned as follows: f : C→ P (1) where C den...

work page

[3] [3]

Speciﬁcally, the polyphone disambigua- tion system converts a polyphonic character to its corresponding pinyin

Method Different from the traditional grapheme-to-phoneme (G2P) conversion, the polyphone disambiguation is considered as a classiﬁcation problem. Speciﬁcally, the polyphone disambigua- tion system converts a polyphonic character to its corresponding pinyin. Our proposed system is shown in Figure 1. In terms of the characteristics and properties of the po...

work page

[4] [4]

Experimental Results 4.1. Polyphonic Character Database For training and evaluating our proposed polyphone disam- biguation systems, we use a publicly available dataset from Bei- jing Data-Baker Science and Technology Ltd which contains 150 frequently used polyphonic characters and their 151585 corresponding sentences. We divide the corpus into a training...

work page

[5] [5]

For the prediction module, we adopt three fully connected layers with size 512, 1024 and 285 respectively

The dropout rate of LSTM is set to 0.1 to avoid overﬁtting [31]. For the prediction module, we adopt three fully connected layers with size 512, 1024 and 285 respectively. The output size 285 is equal to the number of all possible pinyins in this polyphonic character database. The activation function of the ﬁrst two fully connected layers is RELU. We use ...

work page

[6] [6]

We use the polyphonic character database described in section 4.1 for train- ing and evaluating since we do not have the personal labelled data used in [15]

which adopts two LSTM layers with size 512 and the NLPIR toolkit [32] for POS tagging on the text. We use the polyphonic character database described in section 4.1 for train- ing and evaluating since we do not have the personal labelled data used in [15]. The approach presented in [15] had compared with other polyphone disambiguation approaches and shown...

work page

[7] [7]

We explore sentence-level encoding vector as a condition as well as the word-level vector ob- tained from a pre-trained word-to-vector lookup table

Conclusions In this paper, we propose a data-driven approach using condi- tional neural network architecture for Mandarin Chinese poly- phone disambiguation. We explore sentence-level encoding vector as a condition as well as the word-level vector ob- tained from a pre-trained word-to-vector lookup table. Re- sults show that the sentence-level conditional...

work page

[8] [8]

Acknowledgments This research was funded in part by the National Natural Sci- ence Foundation of China (61773413), Natural Science Foun- dation of Guangzhou City (201707010363), Six Talent Peaks project in Jiangsu Province (JY-074), Science and Technology Program of Guangzhou City (201903010040)

work page

[9] [9]

An HMM-Based Man- darin Chinese Text-To-Speech System

Y . Qian, F. Soong, Y . Chen, and M. Chu, “An HMM-Based Man- darin Chinese Text-To-Speech System.” in 2006 International Symposium on Chinese Spoken Language Processing (ISCSLP) , 2006, pp. 223–232

work page 2006

[10] [10]

The HMM-Based Speech Synthesis System (HTS) Version 2.0

H. Zen, T. Nose, J. Yamagishi, S. Sako, T. Masuko, A. W. Black, and K. Tokuda, “The HMM-Based Speech Synthesis System (HTS) Version 2.0.” in 6th ISCA Workshop on Speech Synthesis (SSW-6), 2007, pp. 294–299

work page 2007

[11] [11]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, J. Raiman, and J. Miller, “Deep Voice 3: 2000-Speaker Neural Text-to-Speech,” CoRR, vol. abs/1710.07654, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2000

[12] [12]

Deep Voice: Real-time Neural Text-to-Speech

S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y . Kang, X. Li, J. Miller, A. Ng, and J. Raiman, “Deep Voice: Real-time Neural Text-to-Speech,” CoRR, vol. abs/1702.07825, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, J. Raiman, and Y . Zhou, “Deep Voice 2: Multi-Speaker Neural Text-to-Speech,” CoRR, vol. abs/1705.08947, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Pre- dictions,

J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y . Zhang, Y . Wang, R. Skerrv-Ryanet al., “Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Pre- dictions,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4779–4783

work page 2018

[15] [15]

Grapheme-to- Phoneme Conversion Using Long Short-Term Memory Recur- rent Neural Networks,

K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to- Phoneme Conversion Using Long Short-Term Memory Recur- rent Neural Networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4225–4229

work page 2015

[16] [16]

The Synthesis Rules in A Chinese Text-to-Speech System,

L.-S. Lee, C.-Y . Tseng, and M. Ouh-Young, “The Synthesis Rules in A Chinese Text-to-Speech System,” IEEE Transactions on Acoustics, Speech, and Signal Processing , vol. 37, no. 9, pp. 1309–1320, 1989

work page 1989

[17] [17]

Grapheme-to-Phoneme Conversion in Chinese TTS System,

H. Dong, J. Tao, and B. Xu, “Grapheme-to-Phoneme Conversion in Chinese TTS System,” in 2004 International Symposium on Chinese Spoken Language Processing (ISCSLP), 2004, pp. 165– 168

work page 2004

[18] [18]

Improved Grapheme-to- Phoneme Conversion for Mandarin TTS,

L. Yi, L. Jian, H. Jie, and Z. Xiong, “Improved Grapheme-to- Phoneme Conversion for Mandarin TTS,” Tsinghua Science & Technology, vol. 14, no. 5, pp. 606–611, 2009

work page 2009

[19] [19]

An Overview of Text-to-Speech Synthesis Techniques,

M. Rashad, H. M. El-Bakry, I. R. Isma’il, and N. Mastorakis, “An Overview of Text-to-Speech Synthesis Techniques,”Latest trends on communications and information technology, pp. 84–89, 2010

work page 2010

[20] [20]

Disambiguation of Chinese Polyphonic Characters,

H. Zhang, J. Yu, W. Zhan, and S. Yu, “Disambiguation of Chinese Polyphonic Characters,” in The First International Workshop on MultiMedia Annotation (MMA2001), vol. 1, 2001, pp. 30–1

work page 2001

[21] [21]

An Efﬁcient Way to Learn Rules for Grapheme-to-Phoneme Conversion in Chinese,

Z. Zirong, C. Min, and C. Eric, “An Efﬁcient Way to Learn Rules for Grapheme-to-Phoneme Conversion in Chinese,” in 2002 In- ternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2002, pp. 59–62

work page 2002

[22] [22]

Disambiguating Effectively Chinese Polyphonic Ambiguity Based on Unify Approach,

F.-L. Huang, “Disambiguating Effectively Chinese Polyphonic Ambiguity Based on Unify Approach,” in 2008 International Conference on Machine Learning and Cybernetics (ICMLC) , vol. 6, 2008, pp. 3242–3246

work page 2008

[23] [23]

A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese,

C. Shan, X. Lei, and K. Yao, “A Bi-directional LSTM Approach for Polyphone Disambiguation in Mandarin Chinese,” in2017 In- ternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2017

work page 2017

[24] [24]

Polyphone Disambiguation Based on Maximum Entropy Model in Mandarin Grapheme-to-Phoneme Conversion,

F. Z. Liu and Y . Zhou, “Polyphone Disambiguation Based on Maximum Entropy Model in Mandarin Grapheme-to-Phoneme Conversion,”Key Engineering Materials, vol. 480-481, pp. 1043– 1048, 2011

work page 2011

[25] [25]

Polyphonic Word Disambiguation with Machine Learning Approaches,

J. Liu, W. Qu, X. Tang, Y . Zhang, and Y . Sun, “Polyphonic Word Disambiguation with Machine Learning Approaches,” in 2010 Fourth International Conference on Genetic and Evolution- ary Computing (ICGEC), 2010, pp. 244–247

work page 2010

[26] [26]

Joint-Sequence Models for Grapheme-to- Phoneme Conversion,

M. Bisani and H. Ney, “Joint-Sequence Models for Grapheme-to- Phoneme Conversion,” Speech communication, vol. 50, no. 5, pp. 434–451, 2008

work page 2008

[27] [27]

Inequality Maximum Entropy Classiﬁer with Character Features for Poly- phone Disambiguation in Mandarin TTS Systems,

X. Mao, D. Yuan, J. Han, D. Huang, and H. Wang, “Inequality Maximum Entropy Classiﬁer with Character Features for Poly- phone Disambiguation in Mandarin TTS Systems,” in 2007 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2007

work page 2007

[28] [28]

Image-to-Image Translation with Conditional Adversarial Networks,

P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-Image Translation with Conditional Adversarial Networks,” in the IEEE conference on computer vision and pattern recognition (CVPR) , 2017, pp. 1125–1134

work page 2017

[29] [29]

Recurrent Neural Network Based Language Model,

T. Mikolov, M. Karaﬁ ´at, L. Burget, J. ˇCernock`y, and S. Khu- danpur, “Recurrent Neural Network Based Language Model,” in Eleventh annual conference of the international speech communi- cation association (ISCA), 2010

work page 2010

[30] [30]

Extensions of Recurrent Neural Network Language Model,

T. Mikolov, S. Kombrink, L. Burget, J. ˇCernock`y, and S. Khudan- pur, “Extensions of Recurrent Neural Network Language Model,” in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011, pp. 5528–5531

work page 2011

[31] [31]

Speech Recogni- tion with Deep Recurrent Neural Networks,

A. Graves, A.-r. Mohamed, and G. Hinton, “Speech Recogni- tion with Deep Recurrent Neural Networks,” in 2013 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP), 2013, pp. 6645–6649

work page 2013

[32] [32]

Long Short-Term Memory,

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[33] [33]

On the Properties of Neural Machine Translation: Encoder-Decoder Approaches

K. Cho, B. Van Merri ¨enboer, D. Bahdanau, and Y . Bengio, “On the Properties of Neural Machine Translation: Encoder-Decoder Approaches,” arXiv preprint arXiv:1409.1259, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[34] [34]

Listen, Attend and Spell: A Neural Network for Large V ocabulary Conversational Speech Recognition,

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, Attend and Spell: A Neural Network for Large V ocabulary Conversational Speech Recognition,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , 2016, pp. 4960–4964

work page 2016

[35] [35]

Bidirectional Recurrent Neu- ral Networks,

M. Schuster and K. K. Paliwal, “Bidirectional Recurrent Neu- ral Networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681, 1997

work page 1997

[36] [36]

https://www.data-baker.com/bz dyz.html

work page

[37] [37]

Directional Skip-Gram: Ex- plicitly Distinguishing Left and Right Context for Word Embed- dings,

Y . Song, S. Shi, J. Li, and H. Zhang, “Directional Skip-Gram: Ex- plicitly Distinguishing Left and Right Context for Word Embed- dings,” in the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 2, 2018, pp. 175–180

work page 2018

[38] [38]

https://ai.tencent.com/ailab/nlp/embedding.html

work page

[39] [39]

Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overﬁtting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014

work page 1929

[40] [40]

NLPIR: A Theoretical Framework for Applying Natural Language Processing to Information Retrieval,

L. Zhou and D. Zhang, “NLPIR: A Theoretical Framework for Applying Natural Language Processing to Information Retrieval,” Journal of the American Society for Information Science and Technology, vol. 54, no. 2, pp. 115–123, 2003

work page 2003