Shamela: A Large-Scale Historical Arabic Corpus

Alexander Magidow; Avi Shmidman; Maxim Romanov; Moshe Koppel; Yonatan Belinkov

arxiv: 1612.08989 · v1 · pith:LC56V2FHnew · submitted 2016-12-28 · 💻 cs.CL

Shamela: A Large-Scale Historical Arabic Corpus

Yonatan Belinkov , Alexander Magidow , Maxim Romanov , Avi Shmidman , Moshe Koppel This is my paper

classification 💻 cs.CL

keywords arabiccorpushistoricallarge-scaleanalyzerapplicationautomaticallybillion

0 comments

read the original abstract

Arabic is a widely-spoken language with a rich and long history spanning more than fourteen centuries. Yet existing Arabic corpora largely focus on the modern period or lack sufficient diachronic information. We develop a large-scale, historical corpus of Arabic of about 1 billion words from diverse periods of time. We clean this corpus, process it with a morphological analyzer, and enhance it by detecting parallel passages and automatically dating undated texts. We demonstrate its utility with selected case-studies in which we show its application to the digital humanities.

This paper has not been read by Pith yet.

Shamela: A Large-Scale Historical Arabic Corpus

discussion (0)