pith. sign in

arxiv: 2604.04515 · v1 · submitted 2026-04-06 · 💻 cs.CL

CommonMorph: Participatory Morphological Documentation Platform

Pith reviewed 2026-05-10 20:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords morphological documentationparticipatory platformlow-resource languagesactive learningannotation toolslanguage preservationUniMorph
0
0 comments X

The pith

CommonMorph is a platform that speeds morphological data collection for low-resource languages through expert rules, contributor input, and community validation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Morphological annotation demands linguistic expertise and resources that many languages lack. CommonMorph tackles this with a three-tier system in which experts define linguistic rules, contributors supply examples, and communities review outputs. It reduces effort via active learning for annotation suggestions and tools that import and adapt data from related languages. The design handles fusional, agglutinative, and root-and-pattern morphologies and produces outputs compatible with UniMorph and other NLP tools. Being open source, the platform offers a replicable way to document linguistic diversity.

Core claim

CommonMorph streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools.

What carries the argument

The three-tiered participatory approach of expert linguistic definition, contributor elicitation, and community validation, augmented by active learning and cross-language material adaptation.

Load-bearing premise

Community validation combined with active learning will generate accurate morphological annotations at scale with minimal ongoing expert oversight.

What would settle it

A controlled test on a language with existing expert annotations in which CommonMorph outputs show accuracy below 80 percent compared to the gold standard.

Figures

Figures reproduced from arXiv: 2604.04515 by Aso Mahmudi, Eduard Hovy, Ekaterina Vylomova, Kemal Kurniawan, Rico Sennrich, Sina Ahmadi.

Figure 1
Figure 1. Figure 1: The CommonMorph platform workflow facil￾itates elaboration of morphological structures by a linguist and provides an interoperable ecosys￾tem for contributors to validate and enrich labelled databases. cal analysis and generation rely on large volumes of annotated morphological data—resources that are rarely available for low-resource or endan￾gered languages. The UniMorph project (Bat￾suren et al., 2022b)… view at source ↗
Figure 2
Figure 2. Figure 2: Linguists define (a) paradigm structures, [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of a Latin verb conjugation table [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Screenshots from the speaker interface. correct wordform without needing specialist termi￾nology. This adaptive design allows contributions from both experts and non-experts, combining detailed linguistic knowledge with broader community input. 4.6. Inflection Model in Active Learning The elicitation process follows an active learn￾ing framework (Mahmudi et al., 2025), progress￾ing in iterative cycles to m… view at source ↗
read the original abstract

Collecting and annotating morphological data present significant challenges, requiring linguistic expertise, methodological rigour, and substantial resources. These barriers are particularly acute for low-resource languages and varieties. To accelerate this process, we introduce \texttt{CommonMorph}, a comprehensive platform that streamlines morphological data collection development through a three-tiered approach: expert linguistic definition, contributor elicitation, and community validation. The platform minimises manual work by incorporating active learning, annotation suggestions, and tools to import and adapt materials from related languages. It accommodates diverse morphological systems, including fusional, agglutinative, and root-and-pattern morphologies. Its open-source design and UniMorph-compatible outputs ensure accessibility and interoperability with NLP tools. Our platform is accessible at https://common-morph.com, offering a replicable model for preserving linguistic diversity through collaborative technology.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CommonMorph, a three-tiered participatory platform for morphological data collection in low-resource languages. It consists of expert linguistic definition, contributor elicitation, and community validation layers, augmented by active learning, annotation suggestions, and import/adaptation tools from related languages. The system claims to accommodate fusional, agglutinative, and root-and-pattern morphologies, minimize manual effort, and produce UniMorph-compatible outputs while being open-source and accessible at https://common-morph.com.

Significance. If the described features function as intended, the platform could meaningfully reduce barriers to morphological documentation for endangered and low-resource languages, supporting both linguistic preservation efforts and downstream NLP applications through standardized, interoperable data. The emphasis on participatory design and compatibility with existing resources like UniMorph represents a practical contribution to collaborative language technology.

major comments (2)
  1. [Abstract] Abstract: The central claim that the platform 'minimises manual work' via active learning, annotation suggestions, and import tools rests entirely on architectural description without any user studies, time-on-task measurements, accuracy benchmarks against gold standards, or error-rate evaluations. This absence directly undermines the paper's assertion of streamlining collection and is load-bearing for the contribution.
  2. [Abstract] Abstract / System Description: The assumption that community validation combined with active learning will reliably yield accurate annotations at scale, without systematic errors or excessive expert oversight, is stated but not tested or bounded by any empirical evidence or failure-mode analysis in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We address each major point below and propose targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the platform 'minimises manual work' via active learning, annotation suggestions, and import tools rests entirely on architectural description without any user studies, time-on-task measurements, accuracy benchmarks against gold standards, or error-rate evaluations. This absence directly undermines the paper's assertion of streamlining collection and is load-bearing for the contribution.

    Authors: We agree that the current wording in the abstract presents the minimization of manual work as an achieved outcome rather than a design objective. The manuscript is a system-description paper focused on the platform architecture, features, and UniMorph compatibility; no empirical evaluations were conducted because the platform is newly deployed. We will revise the abstract to state that the platform is designed to minimize manual effort through these mechanisms, and we will add a dedicated 'Limitations and Future Work' section that explicitly notes the absence of user studies and benchmarks while outlining planned evaluations. revision: partial

  2. Referee: [Abstract] Abstract / System Description: The assumption that community validation combined with active learning will reliably yield accurate annotations at scale, without systematic errors or excessive expert oversight, is stated but not tested or bounded by any empirical evidence or failure-mode analysis in the manuscript.

    Authors: We acknowledge that the manuscript does not provide empirical bounds or failure-mode analysis for the community-validation and active-learning components. These are presented as intended mechanisms within the three-tier architecture. We will revise the relevant sections to frame them as design assumptions rather than proven outcomes, and we will expand the new 'Limitations and Future Work' section to include discussion of potential failure modes (e.g., annotation drift, expert oversight requirements) and planned monitoring once the platform accumulates usage data. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a system-description manuscript introducing a three-tier participatory platform for morphological data collection. It contains no equations, no predictive models, no fitted parameters, and no derivation chain that could reduce to its own inputs. Claims about minimizing manual work via active learning and cross-language import tools are architectural assertions, not results derived from prior fitted quantities or self-citations. No load-bearing step matches any of the enumerated circularity patterns; the work is self-contained as an engineering and design contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is a software platform description rather than a theoretical or empirical result, so the ledger contains only a single domain assumption about the viability of participatory collection.

axioms (1)
  • domain assumption Participatory methods with active learning can effectively collect accurate morphological data for diverse language types with limited expert resources.
    This premise underpins the entire three-tier design and is not independently validated in the provided abstract.

pith-pipeline@v0.9.0 · 5449 in / 1121 out tokens · 28198 ms · 2026-05-10T20:03:28.407717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

  1. [1]

    CommonMorph: Participatory Morphological Documentation Platform

    Introduction With over 1,500 languages at risk of extinction by 2100 (Bromham et al., 2022), scaling and improv- ing the efficiency of language documentation and data collection is essential to prevent further irre- versible losses. During the “International Decade of Indigenous Languages” (2022–2032) (United Nations, 2019), tools that assist linguists an...

  2. [2]

    Intuitive Interface: Developed in collabora- tion with both linguists and speakers, the plat- form is intuitive and easy to use, with high user satisfaction reported in surveys (§ 6)

  3. [3]

    These materials can be adapted from existing resources of related languages in the system

    Linguist-led Workflow: Linguists can define initial materials and create patterns to reduce the workload for speakers, improving the effi- ciency of the early annotation process (§ 4.4). These materials can be adapted from existing resources of related languages in the system

  4. [4]

    Built-in Validation: The platform’s interactive evaluation of suggestions allows users to iter- atively refine the data (§ 4.5)

  5. [5]

    Compatibility: Outputs can be exported in standard formats such as UniMorph (Batsuren et al., 2022b), enabling integration with exist- ing NLP tools

  6. [6]

    Free and Open-Source: CommonMorph is ac- cessible to all and openly available for further development and adaptation.1

  7. [7]

    Related Work Recent work in computational linguistics has ex- plored alternative methods for collecting morpho- logical data, ranging from gamified participation to automated annotation. Gamification has been ap- plied to morphological analysis, education, and community engagement, showing potential for improving data coverage and speaker involve- ment (E...

  8. [8]

    Linguists play a central role in defining the scope and quality of morphological resources

    Design Principles The CommonMorph platform is designed to address the core research question:How can morpholog- ical data collection be streamlined and scaled for low-resource and endangered languages through collaboration between linguists and speakers?To achieve this, the platform adheres to principles that balance the needs of its primary contributors,...

  9. [9]

    Based on this setup, the system gener- ates suggested inflected forms, which speakers re- view by correcting or confirming them

    System Design The CommonMorph platform, as depicted in Figure 1, begins with linguists defining initial linguistic ma- terials. Based on this setup, the system gener- ates suggested inflected forms, which speakers re- view by correcting or confirming them. As verified data accumulates, the system improves its sug- gestions through active learning, reducin...

  10. [10]

    3.1.3) covering the inflection classes of the language, and

    Basic wordlist or vocabulary(Bowern, 2015, Sec. 3.1.3) covering the inflection classes of the language, and

  11. [11]

    These com- binations describe how lexical items vary ac- cording to grammatical categories such as per- son, number, tense, aspect, mood, or case

    Morphosyntactic feature combinationsfor each inflection class, typically organised in paradigms or conjugation tables. These com- binations describe how lexical items vary ac- cording to grammatical categories such as per- son, number, tense, aspect, mood, or case. Figure 3 provides an example of such a re- source. In practice, documentation often begins ...

  12. [12]

    Each struc- ture is tied to a specific inflection class

    Paradigm Structures(Figure 2a): This core module defines the morphosyntactic struc- tures that make up inflectional paradigms, such as tense, aspect, and mood. Each struc- ture is tied to a specific inflection class. Op- tional inflectional patterns with placeholders allow the system to generate candidate forms before the machine learning models have suf-...

  13. [13]

    Glosses ensure clarity for both contributors during elicitation and re- searchers during analysis

    Lexicon(Figure 2b): Here, lemmas are entered with their assigned inflection class, stems, and glosses. Glosses ensure clarity for both contributors during elicitation and re- searchers during analysis

  14. [14]

    Simpler inflec- tional systems, such as English verbs, may not require this module

    Reusable Layers(Figure 2c): To avoid redun- dancy, this module allows defining reusable sets of features (e.g., agreement affixes) linked to paradigm structures. Simpler inflec- tional systems, such as English verbs, may not require this module

  15. [15]

    Morphophonological rules (Optional): This module encodes orthographic adjustments (e.g., Turkish vowel harmony) as replacement patterns, implemented as regular expressions to reduce the need for manual corrections

  16. [16]

    How do you tell more than one person to [LEMMA] right now?

    Question Design (Optional)(Figure 2d): This module enables linguists to design elicitation prompts for non-expert speak- ers. Prompts are written in a shared meta- language (e.g., English or Spanish) familiar to both linguist and speaker, and structured as reusable templates with placeholders for lemmas. For example, the prompt“How do you tell more than o...

  17. [17]

    Case studies To demonstrate the platform’s ability to support languages with diverse morphological systems, we conducted a series of case studies covering a wide range of language families. The selected languages include Spanish and Latin (Italic), En- glish and German (Germanic), Hawrami Kurdish, Central Kurdish, and Farsi (Iranic), Arabic (Semitic), Swa...

  18. [18]

    As dis- cussed in Section 2, some previous studies on morphological data collection have provided rough estimates of manual effort

    Evaluation and Results It is difficult to establish a baseline for evaluating the efficiency of our platform because no existing system offers comparable functionalities. As dis- cussed in Section 2, some previous studies on morphological data collection have provided rough estimates of manual effort. However, we consider such comparisons inappropriate as...

  19. [19]

    Conclusion and Future Work This study presents CommonMorph, a collaborative platform designed to document and expand mor- phological resources. Our evaluation demonstrates that the platform effectively supports collaborative data collection by integrating rule-based linguistic input with machine learning suggestions derived from active learning. Both ling...

  20. [20]

    Bibliographical References Sina Ahmadi and Aso Mahmudi. 2023. Revisit- ing and amending Central Kurdish data on Uni- Morph 4.0. InProceedings of the 20th SIGMOR- PHON workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 38–48, Toronto, Canada. Association for Com- putational Linguistics. Jason Baldridge and Alexis Palmer. 2009...

  21. [21]

    machine: Rapid develop- ment of finite-state morphological grammars

    Linguist vs. machine: Rapid develop- ment of finite-state morphological grammars. In Proceedings of the 17th SIGMORPHON Work- shop on Computational Research in Phonetics, Phonology, and Morphology, pages 162–170, Online. Association for Computational Linguis- tics. Claire Bowern. 2015.Linguistic fieldwork: A practi- cal guide. Springer. Lindell Bromham, R...

  22. [22]

    Julio de Urquijo

    Global predictors of language endanger- ment and the future of linguistic diversity.Nature ecology & evolution, 6(2):163–173. Lyle Campbell. 2018. How many language families are there in the world?Anuario del Seminario de Filología Vasca" Julio de Urquijo", 52(1/2):133– 152. Gül¸ sen Eryi˘git, Fatih Bekta¸ s, Ubey Ali, and Bihter Dereli. 2023. Gamificatio...

  23. [23]

    InProceedings of the NAACL HLT 2009 Workshop on Active Learn- ing for Natural Language Processing, pages 36– 44, Boulder, Colorado

    Evaluating automation strategies in lan- guage documentation. InProceedings of the NAACL HLT 2009 Workshop on Active Learn- ing for Natural Language Processing, pages 36– 44, Boulder, Colorado. Association for Computa- tional Linguistics. Alexis Mary Palmer. 2009.Semi-automated an- notation and active learning for language docu- mentation. Ph.D. thesis, T...

  24. [24]

    <STEM1>" and stem2 is

    Dia-lingle: A gamified interface for dialec- tal data collection. InProceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 3: System Demon- strations), pages 148–158, Vienna, Austria. As- sociation for Computational Linguistics. John Sylak-Glassman. 2016. The composition and use of the universal morphological fea...