Normalization of Non-Standard Words in Croatian Texts

Miran Pobar; Sanda Martin\v{c}i\'c-Ip\v{s}i\'c; Slobodan Beliga

arxiv: 1503.08167 · v2 · pith:LLGUQSMFnew · submitted 2015-03-27 · 💻 cs.CL

Normalization of Non-Standard Words in Croatian Texts

Slobodan Beliga , Miran Pobar , Sanda Martin\v{c}i\'c-Ip\v{s}i\'c This is my paper

classification 💻 cs.CL

keywords normalizationwordscroatiannon-standardexpandedformmethodstext

0 comments

read the original abstract

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language together with rule-based normalization methods combined with a lookup dictionary are proposed. Achieved token rate for normalization of Croatian texts is 95%, where 80% of expanded words are in correct morphological form.

This paper has not been read by Pith yet.

Normalization of Non-Standard Words in Croatian Texts

discussion (0)