"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard
Pith reviewed 2026-06-28 01:32 UTC · model grok-4.3
The pith
Web-scraped data for Lombard is mostly misidentified noise and skewed toward Western varieties.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.
What carries the argument
Manual audit that checks language validity, identifies orthographic systems, and classifies regional varieties within the corpora.
If this is right
- Most existing web-scraped Lombard datasets cannot be used directly for training or evaluation because the majority of their content is not Lombard.
- Even the valid Lombard text in current resources follows conflicting spelling conventions that would interfere with model consistency.
- High-quality Lombard data currently over-represents Western varieties and under-represents Eastern varieties.
- Purely quantity-driven scraping will continue to reproduce the same misidentification and bias problems.
- Community involvement in curation is necessary to achieve better variety coverage.
Where Pith is reading between the lines
- The same misidentification and variety-bias pattern is likely to appear in scraped data for other under-resourced language continua.
- NLP data pipelines that prioritize size over verification may systematically exclude minority dialects across many languages.
- Future benchmarks for Lombard should include explicit checks for orthographic consistency and geographic balance.
Load-bearing premise
The manual audit can correctly separate real Lombard text from noise and assign orthographies and regional labels without missing sources or introducing consistent bias.
What would settle it
An independent large-scale audit that finds the majority of the same web-scraped material is correctly identified Lombard with balanced representation of Eastern and Western varieties would falsify the central claims.
read the original abstract
Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a manual audit of parallel and monolingual corpora for Lombard (an under-resourced language continuum), finding that web-scraped datasets are dominated by language misidentification, boilerplate, and non-linguistic noise. It further reports conflicting orthographic systems in the valid Lombard portions and severe representational bias, with high-quality data skewed toward Western varieties while Eastern ones are underrepresented. The work concludes that variety-aware, community-driven curation is needed over quantity-driven scraping.
Significance. If the audit methodology proves robust, the results would usefully document concrete data-quality failures in web-scraped resources for a low-resource language and quantify orthographic and dialectal skews across corpus types. Such evidence could directly inform corpus-construction practices and benchmark design in multilingual NLP.
major comments (2)
- [Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.
- [Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.
minor comments (1)
- [Title] The title is given only in Lombard; an English gloss or subtitle would improve accessibility for the broader NLP audience.
Simulated Author's Rebuttal
We thank the referee for these targeted comments on methodology and results presentation. Both points identify areas where the manuscript can be strengthened with additional transparency, and we will revise accordingly.
read point-by-point responses
-
Referee: [Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.
Authors: We agree the audit procedure requires fuller documentation. The manual classification was performed by the first author (a native speaker of Western Lombard with formal training in Romance linguistics and prior experience annotating dialectal data). We will add a new subsection (likely 2.2) that (a) reproduces the annotation guidelines used for language identification, noise detection, and variety labeling, (b) describes the sampling procedure (stratified random samples of 500–1000 lines per corpus, with explicit handling of very large scraped collections), and (c) states the auditor’s background. Because only one annotator was involved, inter-annotator agreement statistics are not applicable; we will explicitly note this limitation and outline how future work could incorporate multiple annotators. revision: yes
-
Referee: [Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.
Authors: The body of the paper already contains per-corpus counts and qualitative breakdowns, but we accept that these are not presented in a compact, easily verifiable form. We will insert a new summary table (and accompanying text) that reports, for each corpus: total lines examined, exact misidentification rate, boilerplate/non-linguistic noise percentage, and the Western vs. Eastern variety split among the valid Lombard sentences. Sample sizes will be stated explicitly (e.g., “n = 800 lines audited from Common Crawl”). This will allow readers to assess the magnitude of the issues directly. revision: yes
Circularity Check
No circularity: purely empirical audit with no derivations or self-referential steps
full rationale
The paper conducts a manual audit of corpora for Lombard language without any mathematical derivations, equations, fitted parameters, or self-citations that form a load-bearing chain. The analysis relies on direct inspection of data sources, and the central claims are based on empirical observations rather than any closed-loop definitions or predictions. This is a standard empirical study with no circular reasoning present.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
2017 , url =
Paganessi, Giulia , title =. 2017 , url =
2017
-
[2]
Automatic language identification in texts: a survey , year =
Jauhiainen, Tommi and Lui, Marco and Zampieri, Marcos and Baldwin, Timothy and Lind\'. Automatic language identification in texts: a survey , year =. J. Artif. Int. Res. , month = may, pages =. doi:10.1613/jair.1.11675 , abstract =
-
[3]
Multilingua , doi =
The new speakers of Lombard , author =. Multilingua , doi =
-
[4]
Chambers, J. K. and Trudgill, Peter , year=. Dialectology , DOI=
-
[5]
Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog
Eugenio Coseriu , year=. Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog
-
[6]
2010 , url=
Assessing Endangerment: Expanding Fishmans's GIDS , author=. 2010 , url=
2010
-
[7]
Scriver Lombard
Lissander Brasca. Scriver Lombard. 2011
2011
-
[8]
Journal of Multilingual and Multicultural Development , volume =
Paolo Coluzzi and Lissander Brasca and Emanuele Miola , title =. Journal of Multilingual and Multicultural Development , volume =. 2019 , publisher =
2019
-
[9]
Endangered Minority and Regional Languages ('dialects') in Italy , volume =
Coluzzi, Paolo , year =. Endangered Minority and Regional Languages ('dialects') in Italy , volume =. Modern Italy , doi =
-
[10]
Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification
Scherrer, Yves and van der Goot, Rob and M hlum, Petter. Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025
2025
-
[11]
Findings of the V ar D ial Evaluation Campaign 2023
Aepli, No. Findings of the V ar D ial Evaluation Campaign 2023. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023. doi:10.18653/v1/2023.vardial-1.25
-
[12]
Findings of the V ar D ial Evaluation Campaign 2022
Aepli, No. Findings of the V ar D ial Evaluation Campaign 2022. Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects. 2022
2022
-
[13]
2022 , eprint=
No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=
2022
-
[15]
2024 , howpublished =
Mistral-AI , title =. 2024 , howpublished =
2024
-
[16]
2017 , howpublished =
ISTAT , title =. 2017 , howpublished =
2017
-
[17]
2026 , howpublished =
ISTAT , title =. 2026 , howpublished =
2026
-
[18]
2024 , eprint=
EuroLLM: Multilingual Language Models for Europe , author=. 2024 , eprint=
2024
-
[19]
2025 , eprint=
Qwen2.5 Technical Report , author=. 2025 , eprint=
2025
-
[20]
2025 , eprint=
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=
2025
-
[21]
2024 , eprint=
Phi-4 Technical Report , author=. 2024 , eprint=
2024
-
[22]
Maiden, Martin and Parry, Mair , title =
-
[23]
Journal on Ethnopolitics and Minority Issues in Europe , year =
van der Jeught, Stefaan , title =. Journal on Ethnopolitics and Minority Issues in Europe , year =
-
[24]
Moseley, Christopher , title =
-
[25]
arXiv preprint arXiv:1910.09700 , year=
Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=
Pith/arXiv arXiv 1910
-
[26]
Proceedings of ISMTCL , pages =
Delmonte, Rodolfo and Bristot, Antonella and Tonelli, Sara and Pianta, Emanuele , title =. Proceedings of ISMTCL , pages =. 2009 , address =
2009
-
[27]
Una eina per a una llengua en proc
Fronteddu, Gianfranco and Al. Una eina per a una llengua en proc. Linguam. 2017 , doi=
2017
-
[28]
Intelligent Computing (SAI 2022) , pages =
Wdowiak, Eryk , title =. Intelligent Computing (SAI 2022) , pages =. 2022 , address =
2022
-
[29]
The Prague Bulletin of Mathematical Linguistics , volume=
Rule-based machine translation for the Italian--Sardinian language pair , author=. The Prague Bulletin of Mathematical Linguistics , volume=. 2017 , publisher=
2017
-
[30]
News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces
J \"o rg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing. 2009
2009
-
[31]
2024 , eprint=
Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation , author=. 2024 , eprint=
2024
-
[32]
Seth Aycock and David Stap and Di Wu and Christof Monz and Khalil Sima'an , booktitle=. Can. 2025 , url=
2025
-
[33]
2024 , eprint=
A Benchmark for Learning to Translate a New Language from One Grammar Book , author=. 2024 , eprint=
2024
-
[34]
Experiments in Multi-Variant Natural Language Processing for N ahuatl
Pugh, Robert and Tyers, Francis. Experiments in Multi-Variant Natural Language Processing for N ahuatl. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.12
-
[35]
Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish
Simons, Andreas and De Pascale, Stefano and Franco, Karlien. Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.13
-
[36]
Modeling Orthographic Variation in O ccitan ' s Dialects
Hopton, Zachary and Aepli, No. Modeling Orthographic Variation in O ccitan ' s Dialects. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.6
-
[37]
N-Gram-Based Text Categorization , journal =
Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =
-
[38]
Vicente, Aileen Joan and Cheng, Charibeth. Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.16
-
[39]
One-Shot Prompt for Language Variety Identification
Gillin, Nat. One-Shot Prompt for Language Variety Identification. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.20
-
[40]
Chifu, Adrian-Gabriel and Glava s , Goran and Ionescu, Radu Tudor and Ljube s i \'c , Nikola and Mileti \'c , Aleksandra and Mileti \'c , Filip and Scherrer, Yves and Vuli \'c , Ivan. V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification. Proceedings of the Eleventh Workshop on NLP for Simil...
-
[41]
Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =
Bednaříková, Emma and Rychlý, Pavel , title =. Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =. 2025 , pages =
2025
-
[42]
O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles
Lison, Pierre and Tiedemann, J. O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016
2016
-
[43]
Bird, Steven and Klein, Ewan and Loper, Edward , isbn =
-
[44]
arXiv preprint arXiv:2509.06888 , year=
mmbert: A modern multilingual encoder with annealed language learning , author=. arXiv preprint arXiv:2509.06888 , year=
-
[45]
URL https: //aclanthology.org/2025.acl-long.127/
Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127
-
[46]
Icml , volume=
Conditional random fields: Probabilistic models for segmenting and labeling sequence data , author=. Icml , volume=. 2001 , organization=
2001
-
[47]
The Denglisch Corpus of G erman- E nglish Code-Switching
Osmelak, Doreen and Wintner, Shuly. The Denglisch Corpus of G erman- E nglish Code-Switching. Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. 2023. doi:10.18653/v1/2023.sigtyp-1.5
-
[48]
Language Identification of Intra-Word Code-Switching for Arabic–English , journal =
Caroline Sabty and Islam Mesabah and Özlem Çetinoğlu and Slim Abdennadher , keywords =. Language Identification of Intra-Word Code-Switching for Arabic–English , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.array.2021.100104 , url =
-
[49]
, author=
A Hindi-English Code-Switching Corpus. , author=. LREC , pages=
-
[50]
G lot LID : Language Identification for Low-Resource Languages
Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran c ois and Schuetze, Hinrich. G lot LID : Language Identification for Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.410
-
[51]
2017 , url=
FastText.zip: Compressing text classification models , author=. 2017 , url=
2017
-
[52]
Bag of Tricks for Efficient Text Classification
Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017
2017
-
[53]
1981 , address =
Anesa, Marino and Rondi, Mario , title =. 1981 , address =
1981
-
[54]
2000 , publisher=
Bilingual Speech: A Typology of Code-mixing , author=. 2000 , publisher=
2000
-
[55]
Berruto, Gaetano , title =. The Dialects of. 1997 , chapter =. doi:10.4324/9780203993880-46 , url =
-
[56]
Romania et Slavia adriatica
Berruto, Gaetano , title =. Romania et Slavia adriatica. Festschrift für Zarko Muljačić , editor =. 1987 , pages =
1987
-
[57]
Maiden, Martin and Perry, Mair , title =
-
[58]
Wardhaugh, Ronald , title =
-
[59]
Journal of Artificial Intelligence Research , volume=
Automatic language identification in texts: A survey , author=. Journal of Artificial Intelligence Research , volume=
-
[60]
The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =
Lavecchia, Caroline and Sma. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =. 2007 , MONTH = Jun, PDF =
2007
-
[61]
Gumperz , title =
John J. Gumperz , title =. RELC Journal , volume =. 1977 , doi =
1977
-
[62]
Andreoli, Giulia , title =
-
[63]
Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech
Massimo Cerruti. Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech. Lingue e linguaggio, Rivista semestrale. 2018. doi:10.1418/90425
-
[64]
Frighetto, Federica , title =
-
[65]
Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =
Dal Negro, Silvia and Ciccolone, Simone , booktitle =. Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =
-
[66]
Fiorentini, Ilaria , title =
-
[67]
2020 , eprint=
Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=
2020
-
[68]
2020 , eprint=
CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , author=. 2020 , eprint=
2020
-
[69]
W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia
Schwenk, Holger and Chaudhary, Vishrav and Sun, Shuo and Gong, Hongyu and Guzm \'a n, Francisco. W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.115
-
[70]
XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment
El-Kishky, Ahmed and Renduchintala, Adithya and Cross, James and Guzm \'a n, Francisco and Koehn, Philipp. XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.814
-
[71]
Parallel Data, Tools and Interfaces in OPUS
Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. Proceedings of the Eighth International Conference on Language Resources and Evaluation ( LREC '12). 2012
2012
-
[72]
roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=
Piötòst Ché Niènt, Mèi Piötòst - A Manually Revised Lombard-Italian Parallel Corpus , author=. roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=. 2022 , publisher=
2022
-
[73]
Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =
Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =
-
[74]
C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Nguyen, Thuat and others. C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024
2024
-
[75]
Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=
HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=. 2025 , eprint=
2025
-
[76]
Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki
Imani, Ayyoob and others , editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclan...
-
[77]
Scaling neural machine translation to 200 languages
NLLB Team. Scaling neural machine translation to 200 languages. Nature. 2024. doi:10.1038/s41586-024-07335-x
-
[78]
arXiv preprint arXiv:2211.01786 , year=
Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=
-
[79]
GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation
Jones, Alexander and Caswell, Isaac and Firat, Orhan and Saxena, Ishank. GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.26
-
[80]
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[81]
Larkin, Vladimir , title =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.