Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna

arxiv: 2510.04124 · v7 · pith:HYR3DQTKnew · submitted 2025-10-05 · 💻 cs.CL

Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy

Nuwan I. Senaratna This is my paper

Pith reviewed 2026-05-21 21:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords Sri Lankadocument datasetsmultilingual resourcesparliamentary proceedingslegal judgmentsgovernment publicationsnatural language processingSinhala Tamil English

0 comments

The pith

A collection of 269,194 machine-readable Sri Lankan documents in Sinhala, Tamil, and English is now openly available across 26 datasets for law, news, and policy research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a large set of open datasets drawn from Sri Lankan parliamentary proceedings, legal judgments, government publications, news, and tourism statistics. These materials total 269,194 documents and 79.5 gigabytes, distributed across 26 datasets in three languages. The collection is gathered through an automated pipeline, updated daily, and mirrored on public platforms to support computational linguistics, legal analytics, and socio-political studies. A sympathetic reader would care because the resources address the scarcity of high-quality, multilingual official documents from a region often underrepresented in existing research collections.

Core claim

The authors have assembled and released 26 datasets containing 269,194 machine-readable documents from Sri Lanka covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics, all available in Sinhala, Tamil, and English. The datasets are updated daily and provided with descriptions of their sources, collection methods, formats, and licensing details to enable research in computational linguistics, legal analytics, and related fields.

What carries the argument

The automated collection pipeline that retrieves documents from official sources, cleans and formats them into consistent machine-readable structures, and maintains daily updates.

If this is right

Researchers can train and evaluate multilingual natural language processing models on authentic Sri Lankan legal and governmental texts.
Legal analytics work can process and compare large volumes of court judgments and parliamentary records from Sri Lanka.
Socio-political studies gain structured access to government publications and news for tracking policy and public discourse over time.
Daily updates allow ongoing research to incorporate the most recent official documents without manual re-collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable pipelines could be applied to official records in other multilingual countries to create similar research resources.
The datasets could be paired with existing translation models to test performance specifically on Sinhala and Tamil legal language.
Cross-referencing these documents with international legal databases might reveal patterns in how Sri Lankan policy aligns with regional standards.

Load-bearing premise

The automated collection pipeline correctly retrieves, cleans, and formats the source documents without introducing transcription errors or selection biases, and redistribution complies with applicable licensing and ethical rules.

What would settle it

A random sample check against original sources that finds frequent transcription errors, missing sections, or inconsistent formatting in the released documents would undermine the claim that the datasets form a reliable machine-readable resource.

read the original abstract

We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 269,194 documents (79.5 GB) across 26 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2026-05-15-0811.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a collection of 26 open, machine-readable document datasets from Sri Lanka covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics. The collection totals 269,194 documents (79.5 GB) in Sinhala, Tamil, and English, with daily updates and public mirrors on GitHub and Hugging Face. The paper describes the data sources, collection pipeline, formats, potential use cases in computational linguistics, legal analytics, and multilingual NLP, and discusses licensing and ethical considerations.

Significance. If the collection process is accurate and licensing is compliant, this release would be significant as a large-scale multilingual resource for low-resource languages (Sinhala and Tamil) and for domain-specific research in legal and policy documents. The scale, daily updates, and standard hosting platforms support reproducibility and broad accessibility. Open release of such data fills a notable gap and can enable new work in multilingual models and socio-political studies.

major comments (2)

[Collection Pipeline] Collection Pipeline section: the description of automated retrieval, cleaning, and formatting does not include explicit validation steps (e.g., sample audits against original sources or checks for OCR/transcription errors). This detail is needed to support the central claim that the released documents are reliably machine-readable.
[Licensing and Ethical Considerations] Licensing and Ethical Considerations section: licensing is discussed at a high level but no per-dataset license table or explicit redistribution permissions are provided for the 26 datasets. This information is load-bearing for confirming that the open release complies with source terms.

minor comments (2)

[Abstract] Abstract: 'comprises of' should be revised to 'comprises'.
[Data Sources] Data Sources section: a summary table breaking down document counts and sizes by dataset and language would improve clarity and allow readers to assess balance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. The comments identify useful opportunities to strengthen the manuscript's description of the collection process and licensing compliance. We address each major comment below, with revisions incorporated where appropriate.

read point-by-point responses

Referee: [Collection Pipeline] Collection Pipeline section: the description of automated retrieval, cleaning, and formatting does not include explicit validation steps (e.g., sample audits against original sources or checks for OCR/transcription errors). This detail is needed to support the central claim that the released documents are reliably machine-readable.

Authors: We agree that explicit validation details would better support the claim of reliable machine-readability. The revised manuscript expands the Collection Pipeline section with a new paragraph describing our validation procedures: periodic random sampling of 1% of documents for manual comparison against original source PDFs, automated scripts to detect incomplete text extraction, and targeted post-OCR correction for Sinhala and Tamil scripts using language-specific dictionaries. These steps are performed during both initial collection and daily updates. revision: yes
Referee: [Licensing and Ethical Considerations] Licensing and Ethical Considerations section: licensing is discussed at a high level but no per-dataset license table or explicit redistribution permissions are provided for the 26 datasets. This information is load-bearing for confirming that the open release complies with source terms.

Authors: We acknowledge the need for greater transparency on a per-dataset basis. The revised manuscript now includes a dedicated table in the Licensing and Ethical Considerations section listing all 26 datasets, their sources, original terms of use, and our determination that redistribution is permitted (primarily under Sri Lankan government open data policies or news outlet licenses allowing non-commercial reuse). This table confirms compliance with source requirements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; dataset release paper with no derivation chain

full rationale

This is a data release paper describing sources, collection pipeline, formats, and licensing for 26 Sri Lankan document datasets. It contains no equations, fitted parameters, predictions, or mathematical claims. The central contribution is the existence and accessibility of the released resources (mirrored on GitHub and Hugging Face), which are externally verifiable by direct inspection of the cited public sources and mirrors. No self-citation is load-bearing for any derivation, and no step reduces to its own inputs by construction. The paper is self-contained as a descriptive resource announcement.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper with no mathematical derivations. The main implicit assumptions concern the public availability of source documents and the legality of their collection and redistribution, which the abstract indicates are discussed but not formalized as axioms.

pith-pipeline@v0.9.0 · 5641 in / 1257 out tokens · 44218 ms · 2026-05-21T21:56:38.905719+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
cs.CL 2026-06 unverdicted novelty 6.0

New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.