Linguistic complexity: English vs. Polish, text vs. corpus

Adam Orczyk; Jaroslaw Kwapien; Stanislaw Drozdz

arxiv: 1007.0936 · v1 · submitted 2010-07-06 · 💻 cs.CL · physics.soc-ph

Linguistic complexity: English vs. Polish, text vs. corpus

Jaroslaw Kwapien , Stanislaw Drozdz , Adam Orczyk This is my paper

classification 💻 cs.CL physics.soc-ph

keywords corpustextspolishscale-invariantconsistingregimebasicbroken

0 comments

read the original abstract

We analyze the rank-frequency distributions of words in selected English and Polish texts. We show that for the lemmatized (basic) word forms the scale-invariant regime breaks after about two decades, while it might be consistent for the whole range of ranks for the inflected word forms. We also find that for a corpus consisting of texts written by different authors the basic scale-invariant regime is broken more strongly than in the case of comparable corpus consisting of texts written by the same author. Similarly, for a corpus consisting of texts translated into Polish from other languages the scale-invariant regime is broken more strongly than for a comparable corpus of native Polish texts. Moreover, we find that if the words are tagged with their proper part of speech, only verbs show rank-frequency distribution that is almost scale-invariant.

This paper has not been read by Pith yet.

Linguistic complexity: English vs. Polish, text vs. corpus

discussion (0)