Robust Language Identification for Romansh Varieties
Pith reviewed 2026-05-15 09:33 UTC · model grok-4.3
The pith
A support vector machine distinguishes Romansh idioms with 97 percent in-domain accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97 percent, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.
What carries the argument
Support vector machine classifier trained to label text as one of the Romansh idioms or the supra-regional Rumantsch Grischun.
If this is right
- Idiom-aware spell checking becomes feasible for Romansh text.
- Machine translation systems can be adapted to handle specific Romansh varieties.
- The public classifier supports integration into other natural language processing pipelines.
- The approach addresses classification challenges arising from limited mutual intelligibility between idioms.
Where Pith is reading between the lines
- The SVM method could be adapted for variety identification in other low-resource languages that exhibit internal diversity.
- Collecting training data from more domains would help confirm whether performance holds beyond the two tested domains.
- Wider use of such classifiers might improve digital tools available for preserving and promoting Romansh in online contexts.
Load-bearing premise
The newly curated benchmark dataset across two domains is representative of real-world Romansh text and the SVM model will maintain performance on unseen data from additional domains or time periods.
What would settle it
A substantial drop in accuracy when the model is applied to Romansh text from a third domain or collected at a later date would show the claimed robustness does not hold.
read the original abstract
The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an SVM-based language identification system for distinguishing Romansh idioms, including the supra-regional Rumantsch Grischun variety. It reports evaluation on a newly curated benchmark spanning two domains, achieving an average in-domain accuracy of 97%, and releases the classifier publicly for applications such as spell-checking and machine translation.
Significance. If the performance claims hold under detailed scrutiny, the work would fill a documented gap in LID for low-resource language varieties with limited mutual intelligibility. The public release of the model supports reproducibility and downstream use in practical NLP tools. The empirical focus on a real linguistic diversity problem is a positive contribution to computational linguistics for under-resourced languages.
major comments (3)
- [Abstract] Abstract: The central performance claim of 97% average in-domain accuracy is stated without dataset sizes, class balance, feature details, cross-validation procedure, baseline comparisons, or error analysis. These elements are load-bearing for verifying the result and cannot be assessed from the given text.
- [Evaluation] Evaluation: Only in-domain accuracy across the two domains is reported. No cross-domain (train on one domain, test on the other), temporal, or out-of-domain generalization experiments are described, which directly undermines the title's 'robust' framing and the assumption that performance will hold on unseen data.
- [Methods] Methods or Data section: The annotation protocol, source selection, and handling of Rumantsch Grischun overlap with the idioms are not specified. Without these, it is impossible to determine whether the benchmark avoids leakage or represents genuine idiom distinctions.
minor comments (2)
- [Abstract] The abstract refers to 'two domains' without naming them or describing their characteristics (e.g., genre, time period, or collection method).
- [Introduction] Clarify the exact set of idioms modeled and how Rumantsch Grischun is treated as a distinct class versus a mixture.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We will make the revisions outlined below to address the major comments and improve the clarity and robustness of our work.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central performance claim of 97% average in-domain accuracy is stated without dataset sizes, class balance, feature details, cross-validation procedure, baseline comparisons, or error analysis. These elements are load-bearing for verifying the result and cannot be assessed from the given text.
Authors: We agree that these details are essential for verifying the performance claims. We will revise the abstract to incorporate the dataset sizes, class balance information, feature details, cross-validation procedure, baseline comparisons, and a high-level error analysis. This revision will make the abstract self-contained for the key result. revision: yes
-
Referee: [Evaluation] Evaluation: Only in-domain accuracy across the two domains is reported. No cross-domain (train on one domain, test on the other), temporal, or out-of-domain generalization experiments are described, which directly undermines the title's 'robust' framing and the assumption that performance will hold on unseen data.
Authors: The referee correctly points out that the current evaluation is limited to in-domain settings within each of the two domains. While the high accuracy across different domains provides some evidence of robustness, we acknowledge that cross-domain experiments would strengthen the 'robust' claim in the title. We will add these experiments in the revised manuscript, including training on one domain and testing on the other, and discuss the results in terms of generalization to unseen data. We will also consider adjusting the title if the additional results warrant it. revision: yes
-
Referee: [Methods] Methods or Data section: The annotation protocol, source selection, and handling of Rumantsch Grischun overlap with the idioms are not specified. Without these, it is impossible to determine whether the benchmark avoids leakage or represents genuine idiom distinctions.
Authors: We agree that the data curation process requires more transparency. The revised manuscript will include a detailed description of the annotation protocol, the sources selected for each idiom, and how Rumantsch Grischun texts were chosen and processed to minimize overlap with idiom-specific features, ensuring the distinctions are genuine and that no data leakage occurred between train and test splits. revision: yes
Circularity Check
No circularity: empirical accuracy on held-out benchmark
full rationale
The paper presents an SVM-based LID system and reports its average in-domain accuracy of 97% on a newly curated benchmark across two domains. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the text. The central result is a direct empirical measurement on held-out data rather than a reduction to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are used to justify the claims. The evaluation is self-contained against the provided benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Training and test data are drawn from the same distribution within each domain.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.