"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

Edoardo Signoroni; Pavel Rychl\'y

arxiv: 2606.06349 · v1 · pith:D4I5YYS6new · submitted 2026-06-04 · 💻 cs.CL

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

Edoardo Signoroni , Pavel Rychl\'y This is my paper

Pith reviewed 2026-06-28 01:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords Lombard languagetext corporalanguage identificationrepresentational biasunder-resourced languagesweb-scraped dataorthographic variationNLP datasets

0 comments

The pith

Web-scraped data for Lombard is mostly misidentified noise and skewed toward Western varieties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits available parallel and monolingual corpora for Lombard, a language continuum spoken in Italy. It shows that the apparent abundance of web-scraped material collapses under inspection because most text is wrongly labeled as Lombard, consists of boilerplate, or contains non-linguistic content. The small amount of genuine Lombard text that remains displays inconsistent spelling systems and strong geographic imbalance, with Western varieties dominating while Eastern ones appear far less often. The authors conclude that simply collecting more scraped data will not solve the problem and that careful, variety-sensitive curation by local communities is required instead.

Core claim

Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

What carries the argument

Manual audit that checks language validity, identifies orthographic systems, and classifies regional varieties within the corpora.

If this is right

Most existing web-scraped Lombard datasets cannot be used directly for training or evaluation because the majority of their content is not Lombard.
Even the valid Lombard text in current resources follows conflicting spelling conventions that would interfere with model consistency.
High-quality Lombard data currently over-represents Western varieties and under-represents Eastern varieties.
Purely quantity-driven scraping will continue to reproduce the same misidentification and bias problems.
Community involvement in curation is necessary to achieve better variety coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same misidentification and variety-bias pattern is likely to appear in scraped data for other under-resourced language continua.
NLP data pipelines that prioritize size over verification may systematically exclude minority dialects across many languages.
Future benchmarks for Lombard should include explicit checks for orthographic consistency and geographic balance.

Load-bearing premise

The manual audit can correctly separate real Lombard text from noise and assign orthographies and regional labels without missing sources or introducing consistent bias.

What would settle it

An independent large-scale audit that finds the majority of the same web-scraped material is correctly identified Lombard with balanced representation of Eastern and Western varieties would falsify the central claims.

read the original abstract

Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lombard audit flags mostly noisy scraped data and Western variety skew, but the manual labeling lacks any reported validation.

read the letter

The main point is that web-scraped corpora for Lombard turn out to be mostly misidentified text, boilerplate, or noise, while the cleaner portions heavily favor Western varieties over Eastern ones.

The paper runs a manual audit across parallel and monolingual sources, then breaks down orthographic systems and regional coverage in the valid Lombard text. The variety bias finding is the clearest new piece; prior work on low-resource data quality has not zeroed in on Lombard this way.

The audit approach itself is reasonable for the goal. Showing that quantity-driven scraping produces little usable material for this language continuum is a useful reminder for anyone building tools in similar settings.

The soft spot is exactly where the stress-test note flags it. The central numbers on misidentification rates and bias depend on the manual judgments, yet the abstract gives no sampling method, no inter-annotator agreement, no guidelines, and no auditor background. Without those, the reported percentages could shift with different labelers. If the full paper adds those checks, the claims get stronger; right now they rest on unverified manual work.

This is for people doing data curation or low-resource NLP who need concrete examples of why scraped sets fail. A reader working on minority Romance varieties or similar audit projects would get direct value.

I would send it to peer review. The problem is real and the direction is honest, even if the audit details need tightening.

Referee Report

2 major / 1 minor

Summary. The paper conducts a manual audit of parallel and monolingual corpora for Lombard (an under-resourced language continuum), finding that web-scraped datasets are dominated by language misidentification, boilerplate, and non-linguistic noise. It further reports conflicting orthographic systems in the valid Lombard portions and severe representational bias, with high-quality data skewed toward Western varieties while Eastern ones are underrepresented. The work concludes that variety-aware, community-driven curation is needed over quantity-driven scraping.

Significance. If the audit methodology proves robust, the results would usefully document concrete data-quality failures in web-scraped resources for a low-resource language and quantify orthographic and dialectal skews across corpus types. Such evidence could directly inform corpus-construction practices and benchmark design in multilingual NLP.

major comments (2)

[Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.
[Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.

minor comments (1)

[Title] The title is given only in Lombard; an English gloss or subtitle would improve accessibility for the broader NLP audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on methodology and results presentation. Both points identify areas where the manuscript can be strengthened with additional transparency, and we will revise accordingly.

read point-by-point responses

Referee: [Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.

Authors: We agree the audit procedure requires fuller documentation. The manual classification was performed by the first author (a native speaker of Western Lombard with formal training in Romance linguistics and prior experience annotating dialectal data). We will add a new subsection (likely 2.2) that (a) reproduces the annotation guidelines used for language identification, noise detection, and variety labeling, (b) describes the sampling procedure (stratified random samples of 500–1000 lines per corpus, with explicit handling of very large scraped collections), and (c) states the auditor’s background. Because only one annotator was involved, inter-annotator agreement statistics are not applicable; we will explicitly note this limitation and outline how future work could incorporate multiple annotators. revision: yes
Referee: [Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.

Authors: The body of the paper already contains per-corpus counts and qualitative breakdowns, but we accept that these are not presented in a compact, easily verifiable form. We will insert a new summary table (and accompanying text) that reports, for each corpus: total lines examined, exact misidentification rate, boilerplate/non-linguistic noise percentage, and the Western vs. Eastern variety split among the valid Lombard sentences. Sample sizes will be stated explicitly (e.g., “n = 800 lines audited from Common Crawl”). This will allow readers to assess the magnitude of the issues directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with no derivations or self-referential steps

full rationale

The paper conducts a manual audit of corpora for Lombard language without any mathematical derivations, equations, fitted parameters, or self-citations that form a load-bearing chain. The analysis relies on direct inspection of data sources, and the central claims are based on empirical observations rather than any closed-loop definitions or predictions. This is a standard empirical study with no circular reasoning present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical audit study with no free parameters, axioms, or invented entities; all claims rest on the manual audit process and its representativeness.

pith-pipeline@v0.9.1-grok · 5704 in / 1215 out tokens · 34742 ms · 2026-06-28T01:32:45.504344+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

103 extracted references · 27 canonical work pages

[1]

2017 , url =

Paganessi, Giulia , title =. 2017 , url =

2017
[2]

Automatic language identification in texts: a survey , year =

Jauhiainen, Tommi and Lui, Marco and Zampieri, Marcos and Baldwin, Timothy and Lind\'. Automatic language identification in texts: a survey , year =. J. Artif. Int. Res. , month = may, pages =. doi:10.1613/jair.1.11675 , abstract =

work page doi:10.1613/jair.1.11675
[3]

Multilingua , doi =

The new speakers of Lombard , author =. Multilingua , doi =
[4]

Chambers, J. K. and Trudgill, Peter , year=. Dialectology , DOI=
[5]

Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog

Eugenio Coseriu , year=. Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog
[6]

2010 , url=

Assessing Endangerment: Expanding Fishmans's GIDS , author=. 2010 , url=

2010
[7]

Scriver Lombard

Lissander Brasca. Scriver Lombard. 2011

2011
[8]

Journal of Multilingual and Multicultural Development , volume =

Paolo Coluzzi and Lissander Brasca and Emanuele Miola , title =. Journal of Multilingual and Multicultural Development , volume =. 2019 , publisher =

2019
[9]

Endangered Minority and Regional Languages ('dialects') in Italy , volume =

Coluzzi, Paolo , year =. Endangered Minority and Regional Languages ('dialects') in Italy , volume =. Modern Italy , doi =
[10]

Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification

Scherrer, Yves and van der Goot, Rob and M hlum, Petter. Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025

2025
[11]

Findings of the V ar D ial Evaluation Campaign 2023

Aepli, No. Findings of the V ar D ial Evaluation Campaign 2023. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023. doi:10.18653/v1/2023.vardial-1.25

work page doi:10.18653/v1/2023.vardial-1.25 2023
[12]

Findings of the V ar D ial Evaluation Campaign 2022

Aepli, No. Findings of the V ar D ial Evaluation Campaign 2022. Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects. 2022

2022
[13]

2022 , eprint=

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

2022
[15]

2024 , howpublished =

Mistral-AI , title =. 2024 , howpublished =

2024
[16]

2017 , howpublished =

ISTAT , title =. 2017 , howpublished =

2017
[17]

2026 , howpublished =

ISTAT , title =. 2026 , howpublished =

2026
[18]

2024 , eprint=

EuroLLM: Multilingual Language Models for Europe , author=. 2024 , eprint=

2024
[19]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025
[20]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025
[21]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

2024
[22]

Maiden, Martin and Parry, Mair , title =
[23]

Journal on Ethnopolitics and Minority Issues in Europe , year =

van der Jeught, Stefaan , title =. Journal on Ethnopolitics and Minority Issues in Europe , year =
[24]

Moseley, Christopher , title =
[25]

arXiv preprint arXiv:1910.09700 , year=

Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

Pith/arXiv arXiv 1910
[26]

Proceedings of ISMTCL , pages =

Delmonte, Rodolfo and Bristot, Antonella and Tonelli, Sara and Pianta, Emanuele , title =. Proceedings of ISMTCL , pages =. 2009 , address =

2009
[27]

Una eina per a una llengua en proc

Fronteddu, Gianfranco and Al. Una eina per a una llengua en proc. Linguam. 2017 , doi=

2017
[28]

Intelligent Computing (SAI 2022) , pages =

Wdowiak, Eryk , title =. Intelligent Computing (SAI 2022) , pages =. 2022 , address =

2022
[29]

The Prague Bulletin of Mathematical Linguistics , volume=

Rule-based machine translation for the Italian--Sardinian language pair , author=. The Prague Bulletin of Mathematical Linguistics , volume=. 2017 , publisher=

2017
[30]

News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces

J \"o rg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing. 2009

2009
[31]

2024 , eprint=

Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation , author=. 2024 , eprint=

2024
[32]

Seth Aycock and David Stap and Di Wu and Christof Monz and Khalil Sima'an , booktitle=. Can. 2025 , url=

2025
[33]

2024 , eprint=

A Benchmark for Learning to Translate a New Language from One Grammar Book , author=. 2024 , eprint=

2024
[34]

Experiments in Multi-Variant Natural Language Processing for N ahuatl

Pugh, Robert and Tyers, Francis. Experiments in Multi-Variant Natural Language Processing for N ahuatl. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.12

work page doi:10.18653/v1/2024.vardial-1.12 2024
[35]

Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish

Simons, Andreas and De Pascale, Stefano and Franco, Karlien. Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.13

work page doi:10.18653/v1/2024.vardial-1.13 2024
[36]

Modeling Orthographic Variation in O ccitan ' s Dialects

Hopton, Zachary and Aepli, No. Modeling Orthographic Variation in O ccitan ' s Dialects. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.6

work page doi:10.18653/v1/2024.vardial-1.6 2024
[37]

N-Gram-Based Text Categorization , journal =

Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =
[38]

Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages

Vicente, Aileen Joan and Cheng, Charibeth. Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.16

work page doi:10.18653/v1/2024.vardial-1.16 2024
[39]

One-Shot Prompt for Language Variety Identification

Gillin, Nat. One-Shot Prompt for Language Variety Identification. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.20

work page doi:10.18653/v1/2024.vardial-1.20 2024
[40]

V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification

Chifu, Adrian-Gabriel and Glava s , Goran and Ionescu, Radu Tudor and Ljube s i \'c , Nikola and Mileti \'c , Aleksandra and Mileti \'c , Filip and Scherrer, Yves and Vuli \'c , Ivan. V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification. Proceedings of the Eleventh Workshop on NLP for Simil...

work page doi:10.18653/v1/2024.vardial-1.1 2024
[41]

Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =

Bednaříková, Emma and Rychlý, Pavel , title =. Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =. 2025 , pages =

2025
[42]

O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Lison, Pierre and Tiedemann, J. O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

2016
[43]

Bird, Steven and Klein, Ewan and Loper, Edward , isbn =
[44]

arXiv preprint arXiv:2509.06888 , year=

mmbert: A modern multilingual encoder with annealed language learning , author=. arXiv preprint arXiv:2509.06888 , year=

arXiv
[45]

URL https: //aclanthology.org/2025.acl-long.127/

Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

work page doi:10.18653/v1/2025.acl-long.127 2025
[46]

Icml , volume=

Conditional random fields: Probabilistic models for segmenting and labeling sequence data , author=. Icml , volume=. 2001 , organization=

2001
[47]

The Denglisch Corpus of G erman- E nglish Code-Switching

Osmelak, Doreen and Wintner, Shuly. The Denglisch Corpus of G erman- E nglish Code-Switching. Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. 2023. doi:10.18653/v1/2023.sigtyp-1.5

work page doi:10.18653/v1/2023.sigtyp-1.5 2023
[48]

Language Identification of Intra-Word Code-Switching for Arabic–English , journal =

Caroline Sabty and Islam Mesabah and Özlem Çetinoğlu and Slim Abdennadher , keywords =. Language Identification of Intra-Word Code-Switching for Arabic–English , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.array.2021.100104 , url =

work page doi:10.1016/j.array.2021.100104 2021
[49]

, author=

A Hindi-English Code-Switching Corpus. , author=. LREC , pages=
[50]

G lot LID : Language Identification for Low-Resource Languages

Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran c ois and Schuetze, Hinrich. G lot LID : Language Identification for Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.410

work page doi:10.18653/v1/2023.findings-emnlp.410 2023
[51]

2017 , url=

FastText.zip: Compressing text classification models , author=. 2017 , url=

2017
[52]

Bag of Tricks for Efficient Text Classification

Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

2017
[53]

1981 , address =

Anesa, Marino and Rondi, Mario , title =. 1981 , address =

1981
[54]

2000 , publisher=

Bilingual Speech: A Typology of Code-mixing , author=. 2000 , publisher=

2000
[55]

The Dialects of

Berruto, Gaetano , title =. The Dialects of. 1997 , chapter =. doi:10.4324/9780203993880-46 , url =

work page doi:10.4324/9780203993880-46 1997
[56]

Romania et Slavia adriatica

Berruto, Gaetano , title =. Romania et Slavia adriatica. Festschrift für Zarko Muljačić , editor =. 1987 , pages =

1987
[57]

Maiden, Martin and Perry, Mair , title =
[58]

Wardhaugh, Ronald , title =
[59]

Journal of Artificial Intelligence Research , volume=

Automatic language identification in texts: A survey , author=. Journal of Artificial Intelligence Research , volume=
[60]

The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =

Lavecchia, Caroline and Sma. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =. 2007 , MONTH = Jun, PDF =

2007
[61]

Gumperz , title =

John J. Gumperz , title =. RELC Journal , volume =. 1977 , doi =

1977
[62]

Andreoli, Giulia , title =
[63]

Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech

Massimo Cerruti. Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech. Lingue e linguaggio, Rivista semestrale. 2018. doi:10.1418/90425

work page doi:10.1418/90425 2018
[64]

Frighetto, Federica , title =
[65]

Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =

Dal Negro, Silvia and Ciccolone, Simone , booktitle =. Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =
[66]

Fiorentini, Ilaria , title =
[67]

2020 , eprint=

Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=

2020
[68]

2020 , eprint=

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , author=. 2020 , eprint=

2020
[69]

W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia

Schwenk, Holger and Chaudhary, Vishrav and Sun, Shuo and Gong, Hongyu and Guzm \'a n, Francisco. W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.115

work page doi:10.18653/v1/2021.eacl-main.115 2021
[70]

XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

El-Kishky, Ahmed and Renduchintala, Adithya and Cross, James and Guzm \'a n, Francisco and Koehn, Philipp. XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.814

work page doi:10.18653/v1/2021.emnlp-main.814 2021
[71]

Parallel Data, Tools and Interfaces in OPUS

Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. Proceedings of the Eighth International Conference on Language Resources and Evaluation ( LREC '12). 2012

2012
[72]

roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=

Piötòst Ché Niènt, Mèi Piötòst - A Manually Revised Lombard-Italian Parallel Corpus , author=. roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=. 2022 , publisher=

2022
[73]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =

Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =

work page doi:10.14618/ids-pub-9021 2019
[74]

C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Nguyen, Thuat and others. C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024
[75]

Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=. 2025 , eprint=

2025
[76]

Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki

Imani, Ayyoob and others , editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclan...

work page doi:10.18653/v1/2023.acl-long.61 2023
[77]

Scaling neural machine translation to 200 languages

NLLB Team. Scaling neural machine translation to 200 languages. Nature. 2024. doi:10.1038/s41586-024-07335-x

work page doi:10.1038/s41586-024-07335-x 2024
[78]

arXiv preprint arXiv:2211.01786 , year=

Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

arXiv
[79]

GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation

Jones, Alexander and Caswell, Isaac and Firat, Orhan and Saxena, Ishank. GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.26

work page doi:10.18653/v1/2023.emnlp-main.26 2023
[80]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
[81]

Larkin, Vladimir , title =

Showing first 80 references.

[1] [1]

2017 , url =

Paganessi, Giulia , title =. 2017 , url =

2017

[2] [2]

Automatic language identification in texts: a survey , year =

Jauhiainen, Tommi and Lui, Marco and Zampieri, Marcos and Baldwin, Timothy and Lind\'. Automatic language identification in texts: a survey , year =. J. Artif. Int. Res. , month = may, pages =. doi:10.1613/jair.1.11675 , abstract =

work page doi:10.1613/jair.1.11675

[3] [3]

Multilingua , doi =

The new speakers of Lombard , author =. Multilingua , doi =

[4] [4]

Chambers, J. K. and Trudgill, Peter , year=. Dialectology , DOI=

[5] [5]

Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog

Eugenio Coseriu , year=. Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog

[6] [6]

2010 , url=

Assessing Endangerment: Expanding Fishmans's GIDS , author=. 2010 , url=

2010

[7] [7]

Scriver Lombard

Lissander Brasca. Scriver Lombard. 2011

2011

[8] [8]

Journal of Multilingual and Multicultural Development , volume =

Paolo Coluzzi and Lissander Brasca and Emanuele Miola , title =. Journal of Multilingual and Multicultural Development , volume =. 2019 , publisher =

2019

[9] [9]

Endangered Minority and Regional Languages ('dialects') in Italy , volume =

Coluzzi, Paolo , year =. Endangered Minority and Regional Languages ('dialects') in Italy , volume =. Modern Italy , doi =

[10] [10]

Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification

Scherrer, Yves and van der Goot, Rob and M hlum, Petter. Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025

2025

[11] [11]

Findings of the V ar D ial Evaluation Campaign 2023

Aepli, No. Findings of the V ar D ial Evaluation Campaign 2023. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023. doi:10.18653/v1/2023.vardial-1.25

work page doi:10.18653/v1/2023.vardial-1.25 2023

[12] [12]

Findings of the V ar D ial Evaluation Campaign 2022

Aepli, No. Findings of the V ar D ial Evaluation Campaign 2022. Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects. 2022

2022

[13] [13]

2022 , eprint=

No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

2022

[14] [15]

2024 , howpublished =

Mistral-AI , title =. 2024 , howpublished =

2024

[15] [16]

2017 , howpublished =

ISTAT , title =. 2017 , howpublished =

2017

[16] [17]

2026 , howpublished =

ISTAT , title =. 2026 , howpublished =

2026

[17] [18]

2024 , eprint=

EuroLLM: Multilingual Language Models for Europe , author=. 2024 , eprint=

2024

[18] [19]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

2025

[19] [20]

2025 , eprint=

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

2025

[20] [21]

2024 , eprint=

Phi-4 Technical Report , author=. 2024 , eprint=

2024

[21] [22]

Maiden, Martin and Parry, Mair , title =

[22] [23]

Journal on Ethnopolitics and Minority Issues in Europe , year =

van der Jeught, Stefaan , title =. Journal on Ethnopolitics and Minority Issues in Europe , year =

[23] [24]

Moseley, Christopher , title =

[24] [25]

arXiv preprint arXiv:1910.09700 , year=

Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

Pith/arXiv arXiv 1910

[25] [26]

Proceedings of ISMTCL , pages =

Delmonte, Rodolfo and Bristot, Antonella and Tonelli, Sara and Pianta, Emanuele , title =. Proceedings of ISMTCL , pages =. 2009 , address =

2009

[26] [27]

Una eina per a una llengua en proc

Fronteddu, Gianfranco and Al. Una eina per a una llengua en proc. Linguam. 2017 , doi=

2017

[27] [28]

Intelligent Computing (SAI 2022) , pages =

Wdowiak, Eryk , title =. Intelligent Computing (SAI 2022) , pages =. 2022 , address =

2022

[28] [29]

The Prague Bulletin of Mathematical Linguistics , volume=

Rule-based machine translation for the Italian--Sardinian language pair , author=. The Prague Bulletin of Mathematical Linguistics , volume=. 2017 , publisher=

2017

[29] [30]

News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces

J \"o rg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing. 2009

2009

[30] [31]

2024 , eprint=

Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation , author=. 2024 , eprint=

2024

[31] [32]

Seth Aycock and David Stap and Di Wu and Christof Monz and Khalil Sima'an , booktitle=. Can. 2025 , url=

2025

[32] [33]

2024 , eprint=

A Benchmark for Learning to Translate a New Language from One Grammar Book , author=. 2024 , eprint=

2024

[33] [34]

Experiments in Multi-Variant Natural Language Processing for N ahuatl

Pugh, Robert and Tyers, Francis. Experiments in Multi-Variant Natural Language Processing for N ahuatl. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.12

work page doi:10.18653/v1/2024.vardial-1.12 2024

[34] [35]

Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish

Simons, Andreas and De Pascale, Stefano and Franco, Karlien. Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.13

work page doi:10.18653/v1/2024.vardial-1.13 2024

[35] [36]

Modeling Orthographic Variation in O ccitan ' s Dialects

Hopton, Zachary and Aepli, No. Modeling Orthographic Variation in O ccitan ' s Dialects. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.6

work page doi:10.18653/v1/2024.vardial-1.6 2024

[36] [37]

N-Gram-Based Text Categorization , journal =

Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =

[37] [38]

Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages

Vicente, Aileen Joan and Cheng, Charibeth. Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.16

work page doi:10.18653/v1/2024.vardial-1.16 2024

[38] [39]

One-Shot Prompt for Language Variety Identification

Gillin, Nat. One-Shot Prompt for Language Variety Identification. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.20

work page doi:10.18653/v1/2024.vardial-1.20 2024

[39] [40]

V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification

Chifu, Adrian-Gabriel and Glava s , Goran and Ionescu, Radu Tudor and Ljube s i \'c , Nikola and Mileti \'c , Aleksandra and Mileti \'c , Filip and Scherrer, Yves and Vuli \'c , Ivan. V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification. Proceedings of the Eleventh Workshop on NLP for Simil...

work page doi:10.18653/v1/2024.vardial-1.1 2024

[40] [41]

Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =

Bednaříková, Emma and Rychlý, Pavel , title =. Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =. 2025 , pages =

2025

[41] [42]

O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

Lison, Pierre and Tiedemann, J. O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

2016

[42] [43]

Bird, Steven and Klein, Ewan and Loper, Edward , isbn =

[43] [44]

arXiv preprint arXiv:2509.06888 , year=

mmbert: A modern multilingual encoder with annealed language learning , author=. arXiv preprint arXiv:2509.06888 , year=

arXiv

[44] [45]

URL https: //aclanthology.org/2025.acl-long.127/

Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

work page doi:10.18653/v1/2025.acl-long.127 2025

[45] [46]

Icml , volume=

Conditional random fields: Probabilistic models for segmenting and labeling sequence data , author=. Icml , volume=. 2001 , organization=

2001

[46] [47]

The Denglisch Corpus of G erman- E nglish Code-Switching

Osmelak, Doreen and Wintner, Shuly. The Denglisch Corpus of G erman- E nglish Code-Switching. Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. 2023. doi:10.18653/v1/2023.sigtyp-1.5

work page doi:10.18653/v1/2023.sigtyp-1.5 2023

[47] [48]

Language Identification of Intra-Word Code-Switching for Arabic–English , journal =

Caroline Sabty and Islam Mesabah and Özlem Çetinoğlu and Slim Abdennadher , keywords =. Language Identification of Intra-Word Code-Switching for Arabic–English , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.array.2021.100104 , url =

work page doi:10.1016/j.array.2021.100104 2021

[48] [49]

, author=

A Hindi-English Code-Switching Corpus. , author=. LREC , pages=

[49] [50]

G lot LID : Language Identification for Low-Resource Languages

Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran c ois and Schuetze, Hinrich. G lot LID : Language Identification for Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.410

work page doi:10.18653/v1/2023.findings-emnlp.410 2023

[50] [51]

2017 , url=

FastText.zip: Compressing text classification models , author=. 2017 , url=

2017

[51] [52]

Bag of Tricks for Efficient Text Classification

Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

2017

[52] [53]

1981 , address =

Anesa, Marino and Rondi, Mario , title =. 1981 , address =

1981

[53] [54]

2000 , publisher=

Bilingual Speech: A Typology of Code-mixing , author=. 2000 , publisher=

2000

[54] [55]

The Dialects of

Berruto, Gaetano , title =. The Dialects of. 1997 , chapter =. doi:10.4324/9780203993880-46 , url =

work page doi:10.4324/9780203993880-46 1997

[55] [56]

Romania et Slavia adriatica

Berruto, Gaetano , title =. Romania et Slavia adriatica. Festschrift für Zarko Muljačić , editor =. 1987 , pages =

1987

[56] [57]

Maiden, Martin and Perry, Mair , title =

[57] [58]

Wardhaugh, Ronald , title =

[58] [59]

Journal of Artificial Intelligence Research , volume=

Automatic language identification in texts: A survey , author=. Journal of Artificial Intelligence Research , volume=

[59] [60]

The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =

Lavecchia, Caroline and Sma. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =. 2007 , MONTH = Jun, PDF =

2007

[60] [61]

Gumperz , title =

John J. Gumperz , title =. RELC Journal , volume =. 1977 , doi =

1977

[61] [62]

Andreoli, Giulia , title =

[62] [63]

Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech

Massimo Cerruti. Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech. Lingue e linguaggio, Rivista semestrale. 2018. doi:10.1418/90425

work page doi:10.1418/90425 2018

[63] [64]

Frighetto, Federica , title =

[64] [65]

Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =

Dal Negro, Silvia and Ciccolone, Simone , booktitle =. Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =

[65] [66]

Fiorentini, Ilaria , title =

[66] [67]

2020 , eprint=

Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=

2020

[67] [68]

2020 , eprint=

CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , author=. 2020 , eprint=

2020

[68] [69]

W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia

Schwenk, Holger and Chaudhary, Vishrav and Sun, Shuo and Gong, Hongyu and Guzm \'a n, Francisco. W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.115

work page doi:10.18653/v1/2021.eacl-main.115 2021

[69] [70]

XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

El-Kishky, Ahmed and Renduchintala, Adithya and Cross, James and Guzm \'a n, Francisco and Koehn, Philipp. XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.814

work page doi:10.18653/v1/2021.emnlp-main.814 2021

[70] [71]

Parallel Data, Tools and Interfaces in OPUS

Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. Proceedings of the Eighth International Conference on Language Resources and Evaluation ( LREC '12). 2012

2012

[71] [72]

roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=

Piötòst Ché Niènt, Mèi Piötòst - A Manually Revised Lombard-Italian Parallel Corpus , author=. roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=. 2022 , publisher=

2022

[72] [73]

Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =

Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =

work page doi:10.14618/ids-pub-9021 2019

[73] [74]

C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

Nguyen, Thuat and others. C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

2024

[74] [75]

Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=

HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=. 2025 , eprint=

2025

[75] [76]

Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki

Imani, Ayyoob and others , editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclan...

work page doi:10.18653/v1/2023.acl-long.61 2023

[76] [77]

Scaling neural machine translation to 200 languages

NLLB Team. Scaling neural machine translation to 200 languages. Nature. 2024. doi:10.1038/s41586-024-07335-x

work page doi:10.1038/s41586-024-07335-x 2024

[77] [78]

arXiv preprint arXiv:2211.01786 , year=

Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

arXiv

[78] [79]

GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation

Jones, Alexander and Caswell, Isaac and Firat, Orhan and Saxena, Ishank. GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.26

work page doi:10.18653/v1/2023.emnlp-main.26 2023

[79] [80]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

[80] [81]

Larkin, Vladimir , title =