Sri Lanka Document Datasets: A Large-Scale, Multilingual Resource for Law, News, and Policy
Pith reviewed 2026-05-21 21:56 UTC · model grok-4.3
The pith
A collection of 269,194 machine-readable Sri Lankan documents in Sinhala, Tamil, and English is now openly available across 26 datasets for law, news, and policy research.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors have assembled and released 26 datasets containing 269,194 machine-readable documents from Sri Lanka covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics, all available in Sinhala, Tamil, and English. The datasets are updated daily and provided with descriptions of their sources, collection methods, formats, and licensing details to enable research in computational linguistics, legal analytics, and related fields.
What carries the argument
The automated collection pipeline that retrieves documents from official sources, cleans and formats them into consistent machine-readable structures, and maintains daily updates.
If this is right
- Researchers can train and evaluate multilingual natural language processing models on authentic Sri Lankan legal and governmental texts.
- Legal analytics work can process and compare large volumes of court judgments and parliamentary records from Sri Lanka.
- Socio-political studies gain structured access to government publications and news for tracking policy and public discourse over time.
- Daily updates allow ongoing research to incorporate the most recent official documents without manual re-collection.
Where Pith is reading between the lines
- Comparable pipelines could be applied to official records in other multilingual countries to create similar research resources.
- The datasets could be paired with existing translation models to test performance specifically on Sinhala and Tamil legal language.
- Cross-referencing these documents with international legal databases might reveal patterns in how Sri Lankan policy aligns with regional standards.
Load-bearing premise
The automated collection pipeline correctly retrieves, cleans, and formats the source documents without introducing transcription errors or selection biases, and redistribution complies with applicable licensing and ethical rules.
What would settle it
A random sample check against original sources that finds frequent transcription errors, missing sections, or inconsistent formatting in the released documents would undermine the claim that the datasets form a reliable machine-readable resource.
read the original abstract
We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka. The collection currently comprises of 269,194 documents (79.5 GB) across 26 datasets in Sinhala, Tamil, and English. The datasets are updated daily and mirrored on GitHub and Hugging Face. These resources aim to support research in computational linguistics, legal analytics, socio-political studies, and multilingual natural language processing. We describe the data sources, collection pipeline, formats, and potential use cases, while discussing licensing and ethical considerations. This manuscript is at version v2026-05-15-0811.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a collection of 26 open, machine-readable document datasets from Sri Lanka covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics. The collection totals 269,194 documents (79.5 GB) in Sinhala, Tamil, and English, with daily updates and public mirrors on GitHub and Hugging Face. The paper describes the data sources, collection pipeline, formats, potential use cases in computational linguistics, legal analytics, and multilingual NLP, and discusses licensing and ethical considerations.
Significance. If the collection process is accurate and licensing is compliant, this release would be significant as a large-scale multilingual resource for low-resource languages (Sinhala and Tamil) and for domain-specific research in legal and policy documents. The scale, daily updates, and standard hosting platforms support reproducibility and broad accessibility. Open release of such data fills a notable gap and can enable new work in multilingual models and socio-political studies.
major comments (2)
- [Collection Pipeline] Collection Pipeline section: the description of automated retrieval, cleaning, and formatting does not include explicit validation steps (e.g., sample audits against original sources or checks for OCR/transcription errors). This detail is needed to support the central claim that the released documents are reliably machine-readable.
- [Licensing and Ethical Considerations] Licensing and Ethical Considerations section: licensing is discussed at a high level but no per-dataset license table or explicit redistribution permissions are provided for the 26 datasets. This information is load-bearing for confirming that the open release complies with source terms.
minor comments (2)
- [Abstract] Abstract: 'comprises of' should be revised to 'comprises'.
- [Data Sources] Data Sources section: a summary table breaking down document counts and sizes by dataset and language would improve clarity and allow readers to assess balance.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation of minor revision. The comments identify useful opportunities to strengthen the manuscript's description of the collection process and licensing compliance. We address each major comment below, with revisions incorporated where appropriate.
read point-by-point responses
-
Referee: [Collection Pipeline] Collection Pipeline section: the description of automated retrieval, cleaning, and formatting does not include explicit validation steps (e.g., sample audits against original sources or checks for OCR/transcription errors). This detail is needed to support the central claim that the released documents are reliably machine-readable.
Authors: We agree that explicit validation details would better support the claim of reliable machine-readability. The revised manuscript expands the Collection Pipeline section with a new paragraph describing our validation procedures: periodic random sampling of 1% of documents for manual comparison against original source PDFs, automated scripts to detect incomplete text extraction, and targeted post-OCR correction for Sinhala and Tamil scripts using language-specific dictionaries. These steps are performed during both initial collection and daily updates. revision: yes
-
Referee: [Licensing and Ethical Considerations] Licensing and Ethical Considerations section: licensing is discussed at a high level but no per-dataset license table or explicit redistribution permissions are provided for the 26 datasets. This information is load-bearing for confirming that the open release complies with source terms.
Authors: We acknowledge the need for greater transparency on a per-dataset basis. The revised manuscript now includes a dedicated table in the Licensing and Ethical Considerations section listing all 26 datasets, their sources, original terms of use, and our determination that redistribution is permitted (primarily under Sri Lankan government open data policies or news outlet licenses allowing non-commercial reuse). This table confirms compliance with source requirements. revision: yes
Circularity Check
No significant circularity; dataset release paper with no derivation chain
full rationale
This is a data release paper describing sources, collection pipeline, formats, and licensing for 26 Sri Lankan document datasets. It contains no equations, fitted parameters, predictions, or mathematical claims. The central contribution is the existence and accessibility of the released resources (mirrored on GitHub and Hugging Face), which are externally verifiable by direct inspection of the cited public sources and mirrors. No self-citation is load-bearing for any derivation, and no step reduces to its own inputs by construction. The paper is self-contained as a descriptive resource announcement.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a collection of open, machine-readable document datasets covering parliamentary proceedings, legal judgments, government publications, news, and tourism statistics from Sri Lanka.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Cross-Temporal Sinhala OCR: Page-Level Adaptation and Diachronic Analysis
New Sinhala OCR dataset from 1981-2019 legislative acts enables LightOnOCR-2-1B to reach 1.05% CER, beating Surya-OCR, Tesseract, and Google Document AI.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.