pith. sign in

arxiv: 1810.03430 · v1 · pith:N6WAZZ74new · submitted 2018-10-08 · 💻 cs.IR · cs.CL· cs.LG

Cross Script Hindi English NER Corpus from Wikipedia

classification 💻 cs.IR cs.CLcs.LG
keywords corporalingualmixedlanguagescriptstandardtextcross
0
0 comments X p. Extension
pith:N6WAZZ74 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{N6WAZZ74}

Prints a linked pith:N6WAZZ74 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

The text generated on social media platforms is essentially a mixed lingual text. The mixing of language in any form produces considerable amount of difficulty in language processing systems. Moreover, the advancements in language processing research depends upon the availability of standard corpora. The development of mixed lingual Indian Named Entity Recognition (NER) systems are facing obstacles due to unavailability of the standard evaluation corpora. Such corpora may be of mixed lingual nature in which text is written using multiple languages predominantly using a single script only. The motivation of our work is to emphasize the automatic generation such kind of corpora in order to encourage mixed lingual Indian NER. The paper presents the preparation of a Cross Script Hindi-English Corpora from Wikipedia category pages. The corpora is successfully annotated using standard CoNLL-2003 categories of PER, LOC, ORG, and MISC. Its evaluation is carried out on a variety of machine learning algorithms and favorable results are achieved.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models across Modalities

    cs.CL 2025-10 accept novelty 7.0

    A comprehensive survey of code-switched NLP research with LLMs across modalities, covering 327 studies, 15+ tasks, 30+ datasets, and 80+ languages while outlining challenges and a future roadmap.