pith. sign in

arxiv: 2606.24825 · v1 · pith:VT2OLLQSnew · submitted 2026-06-23 · 💻 cs.CL · cs.LG

L3Cube-MahaPOS: A Marathi Part-of-Speech Tagging Dataset and BERT Models

Pith reviewed 2026-06-25 23:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords MarathiPOS taggingdatasetNLPBERTUniversal Dependencieslow-resource languagesnews text
0
0 comments X

The pith

A new manually annotated dataset supplies 32,354 Marathi sentences for part-of-speech tagging and benchmarks models at 88.67 percent token accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper supplies L3Cube-MahaPOS to fill the gap in annotated resources for Marathi despite its large speaker population and distinctive linguistic traits such as rich morphology and code-mixing. The resource contains news sentences labeled manually under a 16-tag scheme aligned with Universal Dependencies, accompanied by a preprocessing pipeline that normalizes Unicode and tokenizes Devanagari script. Six model families are evaluated on the data, with the strongest results coming from a Marathi-specific transformer. The full release of the dataset, guidelines, and checkpoints directly enables downstream work in Marathi machine translation, parsing, and information extraction.

Core claim

We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. The best system achieves 88.67% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes.

What carries the argument

The L3Cube-MahaPOS dataset of 32,354 manually annotated sentences under a 16-tag Universal Dependencies scheme, together with the evaluation pipeline across HMM, CRF, BiLSTM, and transformer families.

If this is right

  • The released dataset and checkpoints can be used immediately to train or fine-tune POS taggers for Marathi text.
  • Downstream tasks such as syntactic parsing and machine translation for Marathi gain an immediate labeled resource.
  • The preprocessing steps provide a reusable template for creating similar corpora in other Devanagari-script languages.
  • The performance gap between traditional and transformer models supplies a baseline for future Marathi-specific model development.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same manual annotation approach could be applied to other under-resourced Indian languages that share morphological complexity and code-mixing patterns.
  • Extending the corpus beyond news text into domains such as literature or social media would test whether the reported accuracies hold in more varied text.
  • The 15-tag evaluation setup leaves one tag class aside, so future work could measure whether including the omitted class changes model rankings.

Load-bearing premise

Manual annotation by Marathi-proficient annotators following the 16-tag scheme produces consistent gold-standard labels with minimal errors or inconsistencies.

What would settle it

An independent review that finds annotation errors or tag inconsistencies in more than a small fraction of the sentences would show the labels are not gold-standard.

Figures

Figures reproduced from arXiv: 2606.24825 by Hariom Ingle, Ishwari Gondkar, Jidnyasa Harad, Raviraj Joshi, Ronit Ghode.

Figure 1
Figure 1. Figure 1: Four-stage preprocessing pipeline for L3Cube [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: visualises training loss, validation loss, and validation macro-F1 across ten epochs for L3Cube-MahaPOS-BERT. Training loss decreases monotonically, while validation loss exhibits a minimum at epoch 3 and gradually increases thereafter—a classic overfitting signature. Early stopping at patience 3 restores the epoch-3 check￾point. 5.4 Confusion Matrix [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Row-normalised confusion matrix for L3Cube-MahaPOS-BERT on the test set (16 tags). Darker diagonal [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: End-to-end architecture of L3Cube-MahaPOS-BERT. The model produces logits over 15 evaluated [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Part-of-Speech (POS) tagging is a foundational NLP task underpinning machine translation, information extraction, and syntactic parsing. Despite Marathi being spoken by over 83 million people and ranking among the top twenty most spoken languages worldwide, it remains severely under-resourced in annotated corpora and standardised evaluation benchmarks. Marathi presents unique challenges for computational modelling owing to its rich morphology, relatively free word order, lack of capitalisation conventions, and pervasive code-mixing with Hindi and English. We introduce L3Cube-MahaPOS, a gold-standard POS tagging dataset for Marathi comprising 32,354 manually annotated sentences drawn from news text. Annotation was performed entirely manually by a team of Marathi-proficient annotators following a 16-tag Universal Dependencies-aligned scheme. A structured preprocessing pipeline covering Unicode normalisation, Devanagari-aware tokenisation, and noise filtering ensures label consistency across all splits. We benchmark the dataset across six model families spanning HMM, CRF, BiLSTM, BiLSTM+CharCNN, MuRIL, and the Marathi-specific transformer MahaBERT-v2. The best system achieves 88.67\% token-level accuracy and a macro-F1 of 81.67% over 15 evaluated tag classes. We release the dataset, annotation guidelines, and trained model checkpoints to foster further research in Marathi NLP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces L3Cube-MahaPOS, a new manually annotated POS tagging dataset for Marathi consisting of 32,354 sentences from news text, using a 16-tag Universal Dependencies-aligned scheme. It describes a preprocessing pipeline and benchmarks six model families (HMM, CRF, BiLSTM variants, MuRIL, MahaBERT-v2), reporting a best token-level accuracy of 88.67% and macro-F1 of 81.67% over 15 tag classes, while releasing the dataset, guidelines, and model checkpoints.

Significance. If the annotation quality holds, the work provides a valuable new resource for an under-resourced language with complex morphology and code-mixing, enabling standardized evaluation and further Marathi NLP research. The release of data and models is a clear strength for reproducibility in low-resource settings.

major comments (1)
  1. [Abstract / Annotation description] Abstract and (presumed) Methods/Annotation section: The central claim that L3Cube-MahaPOS constitutes a 'gold-standard' dataset with reliable labels supporting the reported 88.67% accuracy and 81.67% macro-F1 rests on manual annotation by Marathi-proficient annotators, yet no inter-annotator agreement statistic (Cohen's kappa, Fleiss' kappa, or similar), disagreement resolution protocol, or post-annotation validation is reported. This is load-bearing for the dataset's utility as a benchmark.
minor comments (2)
  1. [Abstract / Results] The abstract mentions '15 evaluated tag classes' while using a 16-tag scheme; clarify the excluded class and any rationale in the results section.
  2. [Dataset description] Data splits (train/dev/test sizes and stratification) are referenced but not quantified in the provided abstract; include explicit numbers and any domain or tag distribution statistics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of L3Cube-MahaPOS as a valuable resource and for the detailed feedback. We address the single major comment below and commit to revisions that strengthen the manuscript's claims about annotation quality.

read point-by-point responses
  1. Referee: [Abstract / Annotation description] Abstract and (presumed) Methods/Annotation section: The central claim that L3Cube-MahaPOS constitutes a 'gold-standard' dataset with reliable labels supporting the reported 88.67% accuracy and 81.67% macro-F1 rests on manual annotation by Marathi-proficient annotators, yet no inter-annotator agreement statistic (Cohen's kappa, Fleiss' kappa, or similar), disagreement resolution protocol, or post-annotation validation is reported. This is load-bearing for the dataset's utility as a benchmark.

    Authors: We agree that the absence of inter-annotator agreement (IAA) metrics and annotation protocol details is a significant omission that weakens the 'gold-standard' claim. The original manuscript reports only that annotation was performed manually by Marathi-proficient annotators following a 16-tag UD-aligned scheme, without quantifying agreement or describing disagreement resolution. In the revised version we will add a new subsection under Methods detailing: (i) the number of annotators and their qualifications, (ii) the exact annotation workflow including how disagreements were resolved (e.g., majority vote or adjudicator review), (iii) IAA statistics (Cohen's kappa on a 10% overlap subset), and (iv) any post-annotation validation steps. These additions will be supported by the existing annotation guidelines that are already being released with the dataset. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical dataset release and benchmarking

full rationale

The paper introduces a new manually annotated POS dataset for Marathi and reports benchmark accuracies from standard models (HMM, CRF, BiLSTM, transformers). No derivations, predictions, or first-principles results are claimed. The annotation process is presented as an input (manual labeling by proficient annotators following UD scheme) rather than derived from model outputs. No self-citations, fitted parameters renamed as predictions, or ansatzes appear in the provided text. The work is self-contained as a data contribution with external model evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies on standard NLP practices for annotation and model evaluation from prior literature; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5792 in / 1153 out tokens · 29214 ms · 2026-06-25T23:43:00.756817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references

  1. [1]

    Thorsten Brants. 2000. TnT -- a statistical part-of-speech tagger. In Sixth Applied Natural Language Processing Conference, pages 224--231, Seattle, Washington, USA

  2. [2]

    Eric Brill. 1992. A simple rule-based part of speech tagger. In Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992

  3. [3]

    Harsh Vijay Chaudhari, Anuja Dinesh Patil, Dhanashree Lavekar, Pranav Khairnar, and Raviraj Joshi. 2023. L3Cube-MahaSocialNER : A social media based M arathi named entity recognition dataset and BERT models. In Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation, pages 93--100

  4. [4]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440--8451

  5. [5]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171--4186

  6. [6]

    Mona Diab, Kadri Hacioglu, and Daniel Jurafsky. 2004. Automatic tagging of A rabic text: From raw text to base phrase chunks. In Proceedings of HLT-NAACL 2004: Short Papers , pages 149--152, Boston, Massachusetts, USA

  7. [7]

    S Hochreiter. 1997. Long short-term memory. Neural Computation MIT-Press

  8. [8]

    Raviraj Joshi. 2022 a . L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources. In Proceedings of the WILDRE-6 workshop within the 13th language resources and evaluation conference, pages 97--101

  9. [9]

    Raviraj Joshi. 2022 b . L3Cube-MahaNLP : M arathi natural language processing datasets, models, and library. arXiv preprint arXiv:2205.14728

  10. [10]

    Nidhi Kowtal and Raviraj Bhuminand Joshi. 2025. L3Cube-MahaEmotions : A M arathi emotion recognition dataset with synthetic annotations using CoTR prompting and large language models. In Proceedings of the 39th Pacific Asia Conference on Language, Information and Computation, pages 419--427

  11. [11]

    Taku Kudo, Kaoru Yamamoto, and Yuji Matsumoto. 2004. Applying conditional random fields to J apanese morphological analysis. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 230--237, Barcelona, Spain

  12. [12]

    Atharva Kulkarni, Meet Mandhane, Manali Likhitkar, Gayatri Kshirsagar, and Raviraj Joshi. 2021. L3Cube-MahaSent : A M arathi tweet-based sentiment analysis dataset. In Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pages 213--220

  13. [13]

    John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data

  14. [14]

    Gary Geunbae Lee, Jeongwon Cha, and Jong-Hyeok Lee. 2002. Syllable-pattern-based unknown-morpheme segmentation and estimation for hybrid part-of-speech tagging of K orean. Computational Linguistics, 28(1):53--70

  15. [15]

    Onkar Litake, Maithili Ravindra Sabane, Parth Sachin Patil, Aparna Abhijeet Ranade, and Raviraj Bhuminand Joshi. 2022. L3Cube-MahaNER : A M arathi named entity recognition dataset and BERT models. In Proceedings of the WILDRE-6 Workshop within the 13th Language Resources and Evaluation Conference , pages 29--34

  16. [16]

    Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional LSTM-CNN s- CRF . In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064--1074

  17. [17]

    Christopher D. Manning. 2011. Part-of- S peech tagging from 97 \ In Computational Linguistics and Intelligent Text Processing, pages 171--189, Berlin, Heidelberg. Springer Berlin Heidelberg

  18. [18]

    Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz

    Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of E nglish: The P enn T reebank. Computational Linguistics, 19(2):313--330

  19. [19]

    Hwee Tou Ng and Jin Kiat Low. 2004. C hinese part-of-speech tagging: One-at-a-time or all-at-once? word-based or character-based? In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, pages 277--284

  20. [20]

    Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman

    Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Haji c , Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. 2016. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evalua...

  21. [21]

    Martha Palmer, Rajesh Bhatt, Bhuvana Narasimhan, Owen Rambow, Dipti Misra Sharma, and Fei Xia. 2009. Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure. In The 7th International Conference on Natural Language Processing, pages 14--17

  22. [22]

    Aabha Pingle, Aditya Vyawahare, Isha Joshi, Rahul Tangsali, and Raviraj Bhuminand Joshi. 2023. L3Cube-MahaSent-MD : A multi-domain M arathi sentiment analysis dataset and transformer models. In Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation, pages 274--281

  23. [23]

    Adwait Ratnaparkhi. 1996. A maximum entropy model for part-of-speech tagging. In Conference on Empirical Methods in Natural Language Processing, pages 133--142

  24. [24]

    Smriti Singh, Kuhoo Gupta, Manish Shrivastava, and Pushpak Bhattacharyya. 2006. Morphological richness offsets resource demand -- experiences in constructing a POS tagger for H indi. In Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions , pages 779--786

  25. [25]

    Manning, and Yoram Singer

    Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North A merican Chapter of the Association for Computational Linguistics , pages 252--259

  26. [26]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems, 30

  27. [27]

    Part-of- S peech Tagging from 97 \

    Manning, Christopher D. Part-of- S peech Tagging from 97 \. Computational Linguistics and Intelligent Text Processing. 2011

  28. [28]

    Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992 , year =

    A simple rule-based part of speech tagger , author =. Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, February 23-26, 1992 , year =

  29. [29]

    Conditional random fields: Probabilistic models for segmenting and labeling sequence data , author =

  30. [30]

    Neural Computation MIT-Press , year =

    Long short-term memory , author =. Neural Computation MIT-Press , year =

  31. [31]

    End-to-end sequence labeling via bi-directional

    Ma, Xuezhe and Hovy, Eduard , booktitle =. End-to-end sequence labeling via bi-directional

  32. [32]

    Advances in Neural Information Processing Systems , volume =

    Attention is all you need , author =. Advances in Neural Information Processing Systems , volume =

  33. [33]

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina , booktitle =

  34. [34]

    Kulkarni, Atharva and Mandhane, Meet and Likhitkar, Manali and Kshirsagar, Gayatri and Joshi, Raviraj , booktitle =

  35. [35]

    The 7th International Conference on Natural Language Processing , pages =

    Hindi syntax: Annotating dependency, lexical predicate-argument structure, and phrase structure , author =. The 7th International Conference on Natural Language Processing , pages =

  36. [36]

    Joshi, Raviraj , journal =

  37. [37]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

    Unsupervised cross-lingual representation learning at scale , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages =

  38. [38]

    Litake, Onkar and Sabane, Maithili Ravindra and Patil, Parth Sachin and Ranade, Aparna Abhijeet and Joshi, Raviraj Bhuminand , booktitle =

  39. [39]

    Pingle, Aabha and Vyawahare, Aditya and Joshi, Isha and Tangsali, Rahul and Joshi, Raviraj Bhuminand , booktitle =

  40. [40]

    Chaudhari, Harsh Vijay and Patil, Anuja Dinesh and Lavekar, Dhanashree and Khairnar, Pranav and Joshi, Raviraj , booktitle =

  41. [41]

    Kowtal, Nidhi and Joshi, Raviraj Bhuminand , booktitle =

  42. [42]

    Conference on Empirical Methods in Natural Language Processing , pages =

    A Maximum Entropy Model for Part-Of-Speech Tagging , author =. Conference on Empirical Methods in Natural Language Processing , pages =

  43. [43]

    and Santorini, Beatrice and Marcinkiewicz, Mary Ann , journal =

    Marcus, Mitchell P. and Santorini, Beatrice and Marcinkiewicz, Mary Ann , journal =. Building a Large Annotated Corpus of

  44. [44]

    Brants, Thorsten , booktitle =

  45. [45]

    Proceedings of the 2003 Human Language Technology Conference of the North

    Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , author =. Proceedings of the 2003 Human Language Technology Conference of the North

  46. [46]

    Automatic Tagging of

    Diab, Mona and Hacioglu, Kadri and Jurafsky, Daniel , booktitle =. Automatic Tagging of

  47. [47]

    Ng, Hwee Tou and Low, Jin Kiat , booktitle =

  48. [48]

    Applying Conditional Random Fields to

    Kudo, Taku and Yamamoto, Kaoru and Matsumoto, Yuji , booktitle =. Applying Conditional Random Fields to

  49. [49]

    Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of

    Lee, Gary Geunbae and Cha, Jeongwon and Lee, Jong-Hyeok , journal =. Syllable-Pattern-Based Unknown-Morpheme Segmentation and Estimation for Hybrid Part-of-Speech Tagging of

  50. [50]

    Morphological Richness Offsets Resource Demand -- Experiences in Constructing a

    Singh, Smriti and Gupta, Kuhoo and Shrivastava, Manish and Bhattacharyya, Pushpak , booktitle =. Morphological Richness Offsets Resource Demand -- Experiences in Constructing a

  51. [51]

    Proceedings of the Tenth International Conference on Language Resources and Evaluation (

    Universal Dependencies v1: A Multilingual Treebank Collection , author =. Proceedings of the Tenth International Conference on Language Resources and Evaluation (

  52. [52]

    Proceedings of the WILDRE-6 workshop within the 13th language resources and evaluation conference , pages=

    L3cube-mahacorpus and mahabert: Marathi monolingual corpus, marathi bert language models, and resources , author=. Proceedings of the WILDRE-6 workshop within the 13th language resources and evaluation conference , pages=