Robust Language Identification for Romansh Varieties

Charlotte Model; Jannis Vamvas; Sina Ahmadi

arxiv: 2603.15969 · v2 · submitted 2026-03-16 · 💻 cs.CL

Robust Language Identification for Romansh Varieties

Charlotte Model , Sina Ahmadi , Jannis Vamvas This is my paper

Pith reviewed 2026-05-15 09:33 UTC · model grok-4.3

classification 💻 cs.CL

keywords language identificationRomanshidiomsRumantsch Grischunsupport vector machinebenchmark datasettext classificationlow-resource languages

0 comments

The pith

A support vector machine distinguishes Romansh idioms with 97 percent in-domain accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a language identification system for the regional idioms of Romansh, a language spoken in parts of Switzerland that includes a standardized form called Rumantsch Grischun. The system relies on support vector machines and is evaluated on a new benchmark dataset from two domains. It reaches 97 percent average accuracy within those domains, opening the door to practical tools that respect the differences between idioms. This matters for building better natural language processing applications for a language with limited resources and internal diversity.

Core claim

We present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97 percent, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

What carries the argument

Support vector machine classifier trained to label text as one of the Romansh idioms or the supra-regional Rumantsch Grischun.

If this is right

Idiom-aware spell checking becomes feasible for Romansh text.
Machine translation systems can be adapted to handle specific Romansh varieties.
The public classifier supports integration into other natural language processing pipelines.
The approach addresses classification challenges arising from limited mutual intelligibility between idioms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The SVM method could be adapted for variety identification in other low-resource languages that exhibit internal diversity.
Collecting training data from more domains would help confirm whether performance holds beyond the two tested domains.
Wider use of such classifiers might improve digital tools available for preserving and promoting Romansh in online contexts.

Load-bearing premise

The newly curated benchmark dataset across two domains is representative of real-world Romansh text and the SVM model will maintain performance on unseen data from additional domains or time periods.

What would settle it

A substantial drop in accuracy when the model is applied to Romansh text from a third domain or collected at a later date would show the claimed robustness does not hold.

read the original abstract

The Romansh language has several regional varieties, called idioms, which sometimes have limited mutual intelligibility. Despite this linguistic diversity, there has been a lack of documented efforts to build a language identification (LID) system that can distinguish between these idioms. Since Romansh LID should also be able to recognize Rumantsch Grischun, a supra-regional variety that combines elements of several idioms, this makes for a novel and interesting classification problem. In this paper, we present a LID system for Romansh idioms based on an SVM approach. We evaluate our model on a newly curated benchmark across two domains and find that it reaches an average in-domain accuracy of 97%, enabling applications such as idiom-aware spell checking or machine translation. Our classifier is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper fills a gap with the first LID system for Romansh idioms plus Rumantsch Grischun and a new benchmark, but the 97% claim rests on in-domain results with almost no experimental details.

read the letter

The paper's key point is that it builds the first language identification system for Romansh idioms together with Rumantsch Grischun, using a new benchmark that reaches 97% in-domain accuracy with an SVM. They do a good job identifying and addressing a clear gap. Romansh has several varieties with limited mutual intelligibility, and adding the standard variety creates a fresh classification task that hadn't been documented before. Releasing the model publicly is a concrete step that could help with practical tools like spell checkers or machine translation systems. The work is straightforward and focused on a real need in low-resource settings. For a language with little prior NLP attention, providing both the benchmark and the classifier is more useful than theoretical advances alone. The main issues are with the evaluation details and the robustness claim. The abstract supplies almost no information on dataset size, class balance, feature selection, or the exact train-test procedure. There are no baselines shown and no error analysis. The results stay within the two domains, so there is no test of how the model performs when the domain shifts. This makes the title's robust framing hard to accept without more evidence. The stress-test note is fair: if the domains are not diverse enough, the high accuracy might not hold up in real use. This paper is for people working on language technology for minority or dialectal languages. A reader who needs a practical starting point for Romansh processing would get value from the released resources. It deserves a serious referee because the motivation is solid and the contribution is new, even though the current presentation leaves too many questions about the data and experiments. I would send it to peer review with requests for more methodological transparency and some form of cross-domain evaluation.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an SVM-based language identification system for distinguishing Romansh idioms, including the supra-regional Rumantsch Grischun variety. It reports evaluation on a newly curated benchmark spanning two domains, achieving an average in-domain accuracy of 97%, and releases the classifier publicly for applications such as spell-checking and machine translation.

Significance. If the performance claims hold under detailed scrutiny, the work would fill a documented gap in LID for low-resource language varieties with limited mutual intelligibility. The public release of the model supports reproducibility and downstream use in practical NLP tools. The empirical focus on a real linguistic diversity problem is a positive contribution to computational linguistics for under-resourced languages.

major comments (3)

[Abstract] Abstract: The central performance claim of 97% average in-domain accuracy is stated without dataset sizes, class balance, feature details, cross-validation procedure, baseline comparisons, or error analysis. These elements are load-bearing for verifying the result and cannot be assessed from the given text.
[Evaluation] Evaluation: Only in-domain accuracy across the two domains is reported. No cross-domain (train on one domain, test on the other), temporal, or out-of-domain generalization experiments are described, which directly undermines the title's 'robust' framing and the assumption that performance will hold on unseen data.
[Methods] Methods or Data section: The annotation protocol, source selection, and handling of Rumantsch Grischun overlap with the idioms are not specified. Without these, it is impossible to determine whether the benchmark avoids leakage or represents genuine idiom distinctions.

minor comments (2)

[Abstract] The abstract refers to 'two domains' without naming them or describing their characteristics (e.g., genre, time period, or collection method).
[Introduction] Clarify the exact set of idioms modeled and how Rumantsch Grischun is treated as a distinct class versus a mixture.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We will make the revisions outlined below to address the major comments and improve the clarity and robustness of our work.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim of 97% average in-domain accuracy is stated without dataset sizes, class balance, feature details, cross-validation procedure, baseline comparisons, or error analysis. These elements are load-bearing for verifying the result and cannot be assessed from the given text.

Authors: We agree that these details are essential for verifying the performance claims. We will revise the abstract to incorporate the dataset sizes, class balance information, feature details, cross-validation procedure, baseline comparisons, and a high-level error analysis. This revision will make the abstract self-contained for the key result. revision: yes
Referee: [Evaluation] Evaluation: Only in-domain accuracy across the two domains is reported. No cross-domain (train on one domain, test on the other), temporal, or out-of-domain generalization experiments are described, which directly undermines the title's 'robust' framing and the assumption that performance will hold on unseen data.

Authors: The referee correctly points out that the current evaluation is limited to in-domain settings within each of the two domains. While the high accuracy across different domains provides some evidence of robustness, we acknowledge that cross-domain experiments would strengthen the 'robust' claim in the title. We will add these experiments in the revised manuscript, including training on one domain and testing on the other, and discuss the results in terms of generalization to unseen data. We will also consider adjusting the title if the additional results warrant it. revision: yes
Referee: [Methods] Methods or Data section: The annotation protocol, source selection, and handling of Rumantsch Grischun overlap with the idioms are not specified. Without these, it is impossible to determine whether the benchmark avoids leakage or represents genuine idiom distinctions.

Authors: We agree that the data curation process requires more transparency. The revised manuscript will include a detailed description of the annotation protocol, the sources selected for each idiom, and how Rumantsch Grischun texts were chosen and processed to minimize overlap with idiom-specific features, ensuring the distinctions are genuine and that no data leakage occurred between train and test splits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical accuracy on held-out benchmark

full rationale

The paper presents an SVM-based LID system and reports its average in-domain accuracy of 97% on a newly curated benchmark across two domains. No equations, derivations, fitted parameters renamed as predictions, or self-referential definitions appear in the text. The central result is a direct empirical measurement on held-out data rather than a reduction to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes are used to justify the claims. The evaluation is self-contained against the provided benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard supervised classification assumptions and the representativeness of the new benchmark; no free parameters, invented entities, or non-standard axioms are introduced beyond the choice of SVM and the curated data.

axioms (1)

domain assumption Training and test data are drawn from the same distribution within each domain.
Implicit in any in-domain accuracy claim for a supervised classifier.

pith-pipeline@v0.9.0 · 5420 in / 1191 out tokens · 44961 ms · 2026-05-15T09:33:51.432342+00:00 · methodology

Robust Language Identification for Romansh Varieties

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)