pith. sign in

arxiv: 1709.00387 · v1 · pith:MCCUZ6STnew · submitted 2017-08-28 · 💻 cs.CL · cs.LG· cs.SD

MIT-QCRI Arabic Dialect Identification System for the 2017 Multi-Genre Broadcast Challenge

classification 💻 cs.CL cs.LGcs.SD
keywords arabicsystemchallengedialectdialectsdomainmgb-3broadcast
0
0 comments X
read the original abstract

In order to successfully annotate the Arabic speech con- tent found in open-domain media broadcasts, it is essential to be able to process a diverse set of Arabic dialects. For the 2017 Multi-Genre Broadcast challenge (MGB-3) there were two possible tasks: Arabic speech recognition, and Arabic Dialect Identification (ADI). In this paper, we describe our efforts to create an ADI system for the MGB-3 challenge, with the goal of distinguishing amongst four major Arabic dialects, as well as Modern Standard Arabic. Our research fo- cused on dialect variability and domain mismatches between the training and test domain. In order to achieve a robust ADI system, we explored both Siamese neural network models to learn similarity and dissimilarities among Arabic dialects, as well as i-vector post-processing to adapt domain mismatches. Both Acoustic and linguistic features were used for the final MGB-3 submissions, with the best primary system achieving 75% accuracy on the official 10hr test set.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.