pith. sign in

arxiv: 2606.23144 · v1 · pith:4BS74JXMnew · submitted 2026-06-22 · 💻 cs.CV · cs.CL

Koshur Pixel: a large-scale synthetic ocr dataset for kashmiri

Pith reviewed 2026-06-26 09:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords synthetic OCR datasetKashmiri languageNastaliq scriptdata augmentationlow-resource languagesoptical character recognitionPerso-Arabic script
0
0 comments X

The pith

Koshur Pixel supplies the first large-scale synthetic OCR dataset for Kashmiri with 613,078 image-text pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Koshur Pixel to overcome the absence of annotated data that blocks optical character recognition for Kashmiri. Kashmiri text in Nastaliq script features contextual shaping, dense ligatures, and variability that demand large training sets. The authors generate the pairs from the KS-PRET-5M text corpus by rendering across fonts and layouts then applying more than 25 augmentations that mimic document wear and imaging artifacts. This synthetic route supplies a scalable substitute for hand-labeled data and creates a base resource for model training. If the generated examples transfer to real documents, the dataset supports digitization of Kashmiri texts and further language-technology work for this under-resourced language.

Core claim

Koshur Pixel is a collection of 613,078 synthetic image-text pairs for Kashmiri, produced from the KS-PRET-5M corpus through the SynthOCR-Gen framework; the pairs cover multiple fonts, range from single words to full-page layouts, and include more than 25 augmentation strategies that emulate real-world degradations.

What carries the argument

SynthOCR-Gen framework that renders text from a large Kashmiri corpus into images across fonts and granularities while applying degradation augmentations.

If this is right

  • Supplies a cost-effective substitute for manual annotation when building Kashmiri OCR systems.
  • Enables training of models that can digitize existing Kashmiri textual heritage.
  • Provides a foundation for language technologies serving a severely under-resourced language.
  • Covers multiple fonts and document scales from isolated words to complete pages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same generation pipeline could be reused for other low-resource languages that employ Nastaliq or related Perso-Arabic scripts.
  • Models trained only on the synthetic set would probably benefit from later fine-tuning on even modest amounts of real scanned data.
  • Direct comparison of error rates between synthetic-only training and mixed synthetic-plus-real training would quantify the dataset's practical value.

Load-bearing premise

Synthetic images produced with the chosen fonts and augmentations are representative enough of real Kashmiri documents to train effective OCR models.

What would settle it

Train an OCR model on Koshur Pixel and measure its character or word error rate on a separate collection of scanned real-world Kashmiri documents.

Figures

Figures reproduced from arXiv: 2606.23144 by Faizan Iqbal, Haq Nawaz Malik, Nahfid Nissar.

Figure 1
Figure 1. Figure 1: Sentence-level renders demonstrating the cursive flow of Nastaliq and varying background textures. (a) Paragraph Sample 1 (b) Paragraph Sample 2 (c) Paragraph Sample 3 (d) Paragraph Sample 4 (e) Paragraph Sample 5 (f) Paragraph Sample 6 (g) Paragraph Sample 7 (h) Paragraph Sample 8 [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Paragraph-level renders showing multi-line complexity and simulated aging effects. (a) Page Sample 1 (b) Page Sample 2 (c) Page Sample 3 (d) Page Sample 4 (e) Page Sample 5 (f) Page Sample 6 (g) Page Sample 7 (h) Page Sample 8 [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Full-page renders simulating complete document layouts and varying font weights. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
read the original abstract

Optical Character Recognition (OCR) for low-resource languages is often constrained by the lack of annotated training data and the complexity of script-specific rendering. Kashmiri, written primarily in the Perso-Arabic Nastaliq script, presents additional challenges due to contextual glyph shaping, dense ligatures, and orthographic variability. We introduce Koshur Pixel, the first large-scale synthetic OCR dataset for Kashmiri, comprising 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. The dataset spans multiple fonts and textual granularities, ranging from individual words to full-page documents, and incorporates more than 25 augmentation strategies that emulate real-world document degradations. Koshur Pixel provides a scalable and cost-effective alternative to manual annotation, establishing a foundational resource for training OCR systems, digitizing Kashmiri textual heritage, and advancing language technologies for a severely under-resourced language.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces Koshur Pixel as the first large-scale synthetic OCR dataset for Kashmiri, comprising 613,078 image-text pairs generated from the KS-PRET-5M corpus using the SynthOCR-Gen framework. It spans multiple fonts and textual granularities (words to full pages) and applies more than 25 augmentation strategies to emulate document degradations, positioning the dataset as a scalable, cost-effective alternative to manual annotation and a foundational resource for OCR training in this low-resource language with Nastaliq script challenges.

Significance. If the synthetic generation process produces data that transfers effectively to real Kashmiri documents, the dataset could provide a valuable starting point for training OCR models where annotated real data is scarce, supporting digitization efforts for Kashmiri textual heritage. The scale and augmentation count are strengths, but the lack of any reported validation metrics means the practical significance remains unassessed.

major comments (1)
  1. [Abstract] Abstract: The central claim that Koshur Pixel 'establishes a foundational resource for training OCR systems' is load-bearing on the assumption that the SynthOCR-Gen pipeline and >25 augmentations produce image-text pairs representative of real Nastaliq ligatures, contextual shaping, and degradations. No quantitative support is supplied—no real-document test set, no CER/WER numbers on held-out data, no ablation on augmentation realism, and no comparison to existing low-resource OCR baselines—preventing assessment of whether models trained on the dataset generalize.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that the abstract's claim requires qualification in the absence of empirical validation metrics, and we will revise the manuscript to address this directly while preserving the core contribution as a synthetic dataset release.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Koshur Pixel 'establishes a foundational resource for training OCR systems' is load-bearing on the assumption that the SynthOCR-Gen pipeline and >25 augmentations produce image-text pairs representative of real Nastaliq ligatures, contextual shaping, and degradations. No quantitative support is supplied—no real-document test set, no CER/WER numbers on held-out data, no ablation on augmentation realism, and no comparison to existing low-resource OCR baselines—preventing assessment of whether models trained on the dataset generalize.

    Authors: We acknowledge the validity of this observation. The manuscript introduces a synthetic dataset to mitigate the absence of annotated Kashmiri OCR data, but the strong phrasing in the abstract does overstate the immediate utility without supporting experiments. Because no large-scale, publicly available annotated real-world Kashmiri document corpus exists for benchmarking, we cannot supply CER/WER results, ablation studies on real data, or baseline comparisons at present. We will revise the abstract to state that Koshur Pixel 'provides a large-scale synthetic resource intended to support the development of OCR systems' rather than claiming it 'establishes a foundational resource.' We will also add a dedicated limitations subsection discussing the synthetic nature of the data, the lack of real-document validation, and the need for future collection of annotated real samples. These changes will make the scope of the contribution explicit. revision: yes

standing simulated objections not resolved
  • Provision of quantitative metrics (CER/WER, ablations, or real-document test sets) on actual Kashmiri documents, as no such annotated real-world resources are currently available for this low-resource language.

Circularity Check

0 steps flagged

No circularity: dataset generation paper with no derivations or fitted predictions

full rationale

The paper describes creation of a synthetic OCR dataset (Koshur Pixel) from the KS-PRET-5M corpus via the SynthOCR-Gen framework plus augmentations. No equations, parameter fitting, predictions, uniqueness theorems, or self-citations appear in the provided text. The central claim is simply that the generated image-text pairs exist and can serve as training data; this does not reduce to any input by construction. The work is self-contained as a resource release and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the unverified effectiveness of the SynthOCR-Gen framework and the representativeness of the KS-PRET-5M corpus for generating usable OCR training data; these are domain assumptions without independent evidence in the abstract.

axioms (2)
  • domain assumption The KS-PRET-5M corpus contains representative Kashmiri text suitable for generating OCR training data.
    Invoked directly in the generation of the 613,078 pairs described in the abstract.
  • domain assumption More than 25 augmentation strategies can emulate real-world document degradations sufficiently for training OCR models.
    Stated as part of the dataset construction without validation details.

pith-pipeline@v0.9.1-grok · 5699 in / 1474 out tokens · 28790 ms · 2026-06-26T09:17:36.511901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Yet, this transformative progress has not been uniformly distributed across the global linguistic landscape

    INTRODUCTION: THE DIGIT AL VOID AND LINGUISTIC INEQUITY The rapid and pervasive advancement of artificial intelligence (AI) and digital technologies has fundamentally reshaped human interaction with information, commerce, and culture. Yet, this transformative progress has not been uniformly distributed across the global linguistic landscape. A significant...

  2. [2]

    BACKGROUND AND RELA TED WORK: KASHMIRI NLP AND OCR EVOLUTION 2.1. The Historical Context of Kashmiri Digitization and NLP The journey of the Kashmiri language into the digital realm has been historically complex and fraught with significant technical hurdles. For several decades, the de facto standard for digital typesetting and printing of Kashmiri text ...

  3. [3]

    This critical technological breakthrough enabled the recovery and standardization of millions of words of previously inaccessible Kashmiri literature

    developed a robust and highly accurate InPage-to-Unicode converter, achieving an impressive 98.7% conversion accuracy. This critical technological breakthrough enabled the recovery and standardization of millions of words of previously inaccessible Kashmiri literature. The output of this conversion process formed the foundational KS-PRET-5M corpus [ 2], w...

  4. [4]

    Nastaliq, a term derived from the Persian words naskh (copying) and ta’liq (hanging), is renowned for its aesthetic beauty and fluid, cursive nature

    DEEP LINGUISTIC ANAL YSIS OF THE KASHMIRI NAST ALIQ SCRIPT Kashmiri is predominantly written in the Perso-Arabic script, specifically adopting the Nastaliq calligraphic style, which accounts for over 98.5% of its written material. Nastaliq, a term derived from the Persian words naskh (copying) and ta’liq (hanging), is renowned for its aesthetic beauty and...

  5. [5]

    Noise Filtering: The diacritics are often treated as extraneous visual noise, such as smudges or scanner artifacts, and are consequently filtered out or ignored during preprocessing, leading to a complete loss of phonological information

  6. [6]

    This results in a transcription that is orthographically incorrect and semantically altered, rendering the text linguistically meaningless

    Character Substitution: The model may incorrectly substitute a unique Kashmiri character with the closest visually similar Arabic or Urdu character it has been trained on. This results in a transcription that is orthographically incorrect and semantically altered, rendering the text linguistically meaningless

  7. [7]

    This can lead to fragmented character recognition or incorrect bounding box predictions, further degrading accuracy

    Segmentation F ailure: Due to the vertical stacking nature of many diacritics, OCR systems often struggle to correctly segment the base character from its diacritical mark. This can lead to fragmented character recognition or incorrect bounding box predictions, further degrading accuracy. A robust Kashmiri OCR system must therefore be explicitly trained o...

  8. [8]

    METHODOLOGY I: THE KS-PRET-5M TEXT CORPUS PIPELINE The efficacy and linguistic fidelity of any synthetic OCR dataset are fundamentally contingent upon the quality and representativeness of its underlying text corpus. For the Koshur Pixel dataset, the foundation is the KS-PRET-5M corpus [ 2], which stands as the most extensive and meticulously cleaned colle...

  9. [9]

    was assembled from diverse digitized sources, including historical archives, literary texts, news articles, and web-crawled content. Owing to substantial heterogeneity in encoding standards, orthographic conventions, and document quality, the raw corpus underwent an extensive eleven-stage normalization pipeline designed to maximize linguistic consistency ...

  10. [10]

    METHODOLOGY II: THE SYNTHOCR-GEN RENDERING ENGINE The second, and arguably most critical, phase in the construction of the Koshur Pixel dataset involves the high-fidelity rendering of the cleaned Unicode text from the KS-PRET-5M corpus into visual images. This process is executed by the SynthOCR-Gen pipeline [ 1], a specialized, fully client-side and brow...

  11. [11]

    METHODOLOGY III: ADV ANCED VISUAL AUGMENT A TION SUITE A critical challenge in the development of synthetic datasets for computer vision, particularly for OCR, is the ”synthetic-to-real” gap. This phenomenon refers to the performance degradation observed when a model trained exclusively on clean, artificially generated images is deployed in real-world sce...

  12. [12]

    DA T ASET ST A TISTICS, DIVERSITY, AND PURITY ANAL YSIS The Koshur Pixel dataset, generated through the meticulously designed pipeline described in Sections 4, 5, and 6, comprises a total of 613,078 high-fidelity image-text pairs. This section provides a comprehensive statistical and qualitative analysis of the dataset, highlighting its scale, linguistic ...

  13. [13]

    DISCUSSION: SOCIO-TECHNICAL IMP ACT, USE CASES, AND COMPUTE BARRIERS The creation and release of the Koshur Pixel dataset represent a significant milestone in the digital trajectory of the Kashmiri language. Beyond its immediate utility as a machine learning resource, this work has profound socio-technical implications, enabling new applications while als...

  14. [14]

    for kashmiri to disambiguate visually similar words based on context). 8.3. The Compute Constraint: A New Barrier While Koshur Pixel effectively dismantles the ”data barrier” for Kashmiri OCR, it simultaneously highlights a new, formidable challenge: the ”compute barrier. ” Training or even fine-tuning state-of-the-art deep learning models on a dataset of...

  15. [15]

    CONCLUSION AND FUTURE WORK In this paper, we have presented Koshur Pixel , a groundbreaking and meticulously constructed large-scale synthetic OCR dataset specifically tailored for the Kashmiri language. By generating 613,078 high-fidelity image-text pairs, we have successfully provided a critical and unprecedented resource that effectively breaks the lon...

  16. [16]

    Generative Handwriting Synthesis: A significant challenge remains in recognizing Kashmiri handwriting, which exhibits even greater variability than printed text. We plan to explore the development of advanced generative models, such as Diffusion models or Generative Adversarial Networks (GANs), to synthesize realistic Kashmiri handwriting based on Unicode...

  17. [17]

    End-to-End Document Understanding: While Koshur Pixel includes page-level renders, real-world historical documents often feature highly complex and non-standard layouts, including marginalia, stamps, watermarks, and multi-column poetry. Future work will focus on expanding the dataset to include more diverse and challenging document layouts, and subsequent...

  18. [18]

    We will investigate how models trained on the Perso-Arabic Koshur Pixel dataset can be adapted or transferred to recognize these other scripts

    Cross-Script and Cross-Lingual T ransfer Learning: Kashmiri is also written in the Devanagari script (primarily by Kashmiri Pandits) and historically in the Sharada script. We will investigate how models trained on the Perso-Arabic Koshur Pixel dataset can be adapted or transferred to recognize these other scripts. Furthermore, we will explore the potenti...

  19. [19]

    Synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier,

    H. N. Malik, K. M. Shafi, and T. A. Reshi, “Synthocr-gen: A synthetic ocr dataset generator for low-resource languages- breaking the data barrier,” arXiv preprint arXiv:2601.16113, 2026

  20. [20]

    Ks-pret-5m: A 5 million word, 12 million token kashmiri pretraining dataset,

    H. N. Malik and N. Nissar, “Ks-pret-5m: A 5 million word, 12 million token kashmiri pretraining dataset,” arXiv preprint arXiv:2604.11066, 2026

  21. [21]

    UNESCO, Atlas of the world’s languages in danger, https://en.unesco.org/languages- atlas, Accessed: 2026-06-19

  22. [22]

    wikipedia

    Wikipedia contributors, Inpage, https : / / en . wikipedia . org / wiki / InPage, Accessed: 2026-06-20, 2026

  23. [23]

    Recovering the lost generation: A robust inpage-to-unicode converter for kashmiri,

    H. N. Malik, “Recovering the lost generation: A robust inpage-to-unicode converter for kashmiri,” Journal of Digital Humanities and Linguistic Preservation , vol. 12, no. 4, pp. 45–58, 2024

  24. [24]

    Historical review of ocr research and development,

    S. Mori, C. Y. Suen, and K. Yamamoto, “Historical review of ocr research and development,” Proceedings of the IEEE , vol. 80, no. 7, pp. 1029–1058, 1992. doi: 10.1109/5.156468

  25. [25]

    Hidden markov model based optical character recognition in the presence of deterministic transformations,

    S.-L. Kuo and O. E. Agazzi, “Hidden markov model based optical character recognition in the presence of deterministic transformations,” Pattern Recognition , vol. 26, no. 12, pp. 1813–1826, 1993. doi: 10 . 1016 / 0031 - 3203(93)90178-Y

  26. [26]

    Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , year =

    A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd International Conference on Machine Learning , 2006, pp. 369–376. doi: 10.1145/1143844.1143891

  27. [28]

    Trocr: Transformer-based optical character recognition with pre-trained models,

    M. Li et al., “Trocr: Transformer-based optical character recognition with pre-trained models,” arXiv preprint arXiv:2109.10282, 2021. doi: 10. 48550/arXiv.2109.10282 arXiv: 2109.10282 [cs.CL]

  28. [29]

    Wang, Z., Codella, N., Chen, Y ., Zhou, L., Dai, X., Xiao, B., Yang, J., You, H., Chang, K., Chang, S., and Yuan, L

    G. Kim et al., “Ocr-free document understanding transformer,” arXiv preprint arXiv:2111.15664, 2021. doi: 10.48550/arXiv. 2111.15664 arXiv: 2111.15664 [cs.LG]

  29. [30]

    Synthetic data for text localisation in natural images,

    A. Gupta, A. Vedaldi, and A. Zisserman, “Synthetic data for text localisation in natural images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016

  30. [31]

    kashmirilanguage

    Kashmiri Language, Kashmiri language fonts and resources , https : / / www . kashmirilanguage . com/ , Accessed: 2026-06-22, 2026

  31. [32]

    Development of a unicode compliant kashmiri font: Issues 13 KOSHUR PIXEL: A LARGE-SCALE SYNTHETIC OCR DATASET FOR KASHMIRI and resolution,

    A. A. Lawey and N. Mehdi, “Development of a unicode compliant kashmiri font: Issues 13 KOSHUR PIXEL: A LARGE-SCALE SYNTHETIC OCR DATASET FOR KASHMIRI and resolution,” Interdisciplinary Journal of Linguistics, vol. 4, pp. 195–200, 2011. [Online]. A vailable: https : / / linguistics . uok . edu . in / Files / f6ec3740 - 422d - 4ac1 - 9f52 - ddfe2cffcb28 / J...

  32. [33]

    Paligemma: A versatile 3b vlm for transfer,

    L. Beyer et al., “Paligemma: A versatile 3b vlm for transfer,” arXiv preprint arXiv:2407.07726 ,

  33. [34]

    doi: 10.48550/arXiv.2407.07726 arXiv: 2407.07726 [cs.CV]

  34. [35]

    Indicbert: A monolingual and multilingual benchmark for indic languages,

    D. Kakwani et al., “Indicbert: A monolingual and multilingual benchmark for indic languages,” in Findings of the Association for Computational Linguistics: EMNLP , 2020

  35. [36]

    An overview of the tesseract ocr engine,

    R. Smith, “An overview of the tesseract ocr engine,” in Ninth International Conference on Document Analysis and Recognition (ICDAR) , 2007

  36. [37]

    Koshur diacritizer: A byte-level sequence-to-sequence model for kashmiri diacritic restoration,

    H. N. Malik, N. Nissar, and F. Iqbal, “Koshur diacritizer: A byte-level sequence-to-sequence model for kashmiri diacritic restoration,” arXiv preprint arXiv:2606.15883 , 2026. arXiv: 2606. 15883 [cs.CL] . [Online]. A vailable: https : / / arxiv.org/abs/2606.15883 A. VISUAL SAMPLES AND DA T ASET DIVERSITY This appendix provides a comprehensive visual overv...