T ext extraction Once the raw dumps are converted into line‑oriented JSON (JSONL) files, each page is processed in batches to extract usable text and metadata

Methodology 3

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Wiki Dumps to Training Corpora: South Slavic Case

cs.CL · 2026-04-28 · unverdicted · novelty 4.0 · 2 refs

A two-phase pipeline extracts clean text from Wikimedia dumps and applies n-gram filtering to remove repetitive low-quality articles for South Slavic language corpora.

citing papers explorer

Showing 1 of 1 citing paper.

Wiki Dumps to Training Corpora: South Slavic Case cs.CL · 2026-04-28 · unverdicted · none · ref 3 · 2 links
A two-phase pipeline extracts clean text from Wikimedia dumps and applies n-gram filtering to remove repetitive low-quality articles for South Slavic language corpora.

T ext extraction Once the raw dumps are converted into line‑oriented JSON (JSONL) files, each page is processed in batches to extract usable text and metadata

fields

years

verdicts

representative citing papers

citing papers explorer