pith. sign in

arxiv: 2606.06349 · v1 · pith:D4I5YYS6new · submitted 2026-06-04 · 💻 cs.CL

"Chi nas dal soch el sent de legn" -- Auditing Text Corpora for Lombard

Pith reviewed 2026-06-28 01:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords Lombard languagetext corporalanguage identificationrepresentational biasunder-resourced languagesweb-scraped dataorthographic variationNLP datasets
0
0 comments X

The pith

Web-scraped data for Lombard is mostly misidentified noise and skewed toward Western varieties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper audits available parallel and monolingual corpora for Lombard, a language continuum spoken in Italy. It shows that the apparent abundance of web-scraped material collapses under inspection because most text is wrongly labeled as Lombard, consists of boilerplate, or contains non-linguistic content. The small amount of genuine Lombard text that remains displays inconsistent spelling systems and strong geographic imbalance, with Western varieties dominating while Eastern ones appear far less often. The authors conclude that simply collecting more scraped data will not solve the problem and that careful, variety-sensitive curation by local communities is required instead.

Core claim

Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

What carries the argument

Manual audit that checks language validity, identifies orthographic systems, and classifies regional varieties within the corpora.

If this is right

  • Most existing web-scraped Lombard datasets cannot be used directly for training or evaluation because the majority of their content is not Lombard.
  • Even the valid Lombard text in current resources follows conflicting spelling conventions that would interfere with model consistency.
  • High-quality Lombard data currently over-represents Western varieties and under-represents Eastern varieties.
  • Purely quantity-driven scraping will continue to reproduce the same misidentification and bias problems.
  • Community involvement in curation is necessary to achieve better variety coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same misidentification and variety-bias pattern is likely to appear in scraped data for other under-resourced language continua.
  • NLP data pipelines that prioritize size over verification may systematically exclude minority dialects across many languages.
  • Future benchmarks for Lombard should include explicit checks for orthographic consistency and geographic balance.

Load-bearing premise

The manual audit can correctly separate real Lombard text from noise and assign orthographies and regional labels without missing sources or introducing consistent bias.

What would settle it

An independent large-scale audit that finds the majority of the same web-scraped material is correctly identified Lombard with balanced representation of Eastern and Western varieties would falsify the central claims.

read the original abstract

Several of the world's languages are still under-resourced in terms of Natural Language Processing (NLP) tools. This is mostly due to the lack of high-quality datasets to train, develop, and evaluate systems and models for several tasks, such as Machine Translation (MT). We conduct a manual audit of the parallel and monolingual corpora available for Lombard, an under-resourced language continuum from Italy. Our analysis reveals that the perceived abundance of web-scraped data is an illusion, with massive datasets plagued by severe language misidentification, boilerplate text, and non-linguistic noise. Furthermore, we analyze the orthographic composition of the valid Lombard portions across web-scraped datasets, curated corpora, and benchmarks. Our findings show conflicting orthographical systems and severe representational bias across all corpora: high-quality data is heavily skewed towards Western Lombard varieties, with Eastern ones left on the margins. This underscores the need for variety-aware, community-driven data curation rather than purely quantity-driven scraping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper conducts a manual audit of parallel and monolingual corpora for Lombard (an under-resourced language continuum), finding that web-scraped datasets are dominated by language misidentification, boilerplate, and non-linguistic noise. It further reports conflicting orthographic systems in the valid Lombard portions and severe representational bias, with high-quality data skewed toward Western varieties while Eastern ones are underrepresented. The work concludes that variety-aware, community-driven curation is needed over quantity-driven scraping.

Significance. If the audit methodology proves robust, the results would usefully document concrete data-quality failures in web-scraped resources for a low-resource language and quantify orthographic and dialectal skews across corpus types. Such evidence could directly inform corpus-construction practices and benchmark design in multilingual NLP.

major comments (2)
  1. [Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.
  2. [Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.
minor comments (1)
  1. [Title] The title is given only in Lombard; an English gloss or subtitle would improve accessibility for the broader NLP audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these targeted comments on methodology and results presentation. Both points identify areas where the manuscript can be strengthened with additional transparency, and we will revise accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods / Audit procedure: No inter-annotator agreement, annotation guidelines, sampling procedure for large scraped collections, or auditor background is reported. Because the central claims (misidentification rates, orthographic composition, and Western/Eastern bias) rest entirely on the reliability of this manual classification, the absence of validation metrics leaves the quantitative findings unsupported.

    Authors: We agree the audit procedure requires fuller documentation. The manual classification was performed by the first author (a native speaker of Western Lombard with formal training in Romance linguistics and prior experience annotating dialectal data). We will add a new subsection (likely 2.2) that (a) reproduces the annotation guidelines used for language identification, noise detection, and variety labeling, (b) describes the sampling procedure (stratified random samples of 500–1000 lines per corpus, with explicit handling of very large scraped collections), and (c) states the auditor’s background. Because only one annotator was involved, inter-annotator agreement statistics are not applicable; we will explicitly note this limitation and outline how future work could incorporate multiple annotators. revision: yes

  2. Referee: [Results] Results / Quantitative findings: The abstract states that web-scraped data are “plagued by severe” noise and that high-quality data are “heavily skewed,” yet no exact percentages, sample sizes, or tables of misidentification rates per corpus are referenced. Without these figures it is impossible to judge whether the reported problems are load-bearing or marginal.

    Authors: The body of the paper already contains per-corpus counts and qualitative breakdowns, but we accept that these are not presented in a compact, easily verifiable form. We will insert a new summary table (and accompanying text) that reports, for each corpus: total lines examined, exact misidentification rate, boilerplate/non-linguistic noise percentage, and the Western vs. Eastern variety split among the valid Lombard sentences. Sample sizes will be stated explicitly (e.g., “n = 800 lines audited from Common Crawl”). This will allow readers to assess the magnitude of the issues directly. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical audit with no derivations or self-referential steps

full rationale

The paper conducts a manual audit of corpora for Lombard language without any mathematical derivations, equations, fitted parameters, or self-citations that form a load-bearing chain. The analysis relies on direct inspection of data sources, and the central claims are based on empirical observations rather than any closed-loop definitions or predictions. This is a standard empirical study with no circular reasoning present.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical audit study with no free parameters, axioms, or invented entities; all claims rest on the manual audit process and its representativeness.

pith-pipeline@v0.9.1-grok · 5704 in / 1215 out tokens · 34742 ms · 2026-06-28T01:32:45.504344+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 27 canonical work pages

  1. [1]

    2017 , url =

    Paganessi, Giulia , title =. 2017 , url =

  2. [2]

    Automatic language identification in texts: a survey , year =

    Jauhiainen, Tommi and Lui, Marco and Zampieri, Marcos and Baldwin, Timothy and Lind\'. Automatic language identification in texts: a survey , year =. J. Artif. Int. Res. , month = may, pages =. doi:10.1613/jair.1.11675 , abstract =

  3. [3]

    Multilingua , doi =

    The new speakers of Lombard , author =. Multilingua , doi =

  4. [4]

    Chambers, J. K. and Trudgill, Peter , year=. Dialectology , DOI=

  5. [5]

    Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog

    Eugenio Coseriu , year=. Los conceptos de dialecto, nivel y estilo de lengua y el sentido propio de la dialectolog

  6. [6]

    2010 , url=

    Assessing Endangerment: Expanding Fishmans's GIDS , author=. 2010 , url=

  7. [7]

    Scriver Lombard

    Lissander Brasca. Scriver Lombard. 2011

  8. [8]

    Journal of Multilingual and Multicultural Development , volume =

    Paolo Coluzzi and Lissander Brasca and Emanuele Miola , title =. Journal of Multilingual and Multicultural Development , volume =. 2019 , publisher =

  9. [9]

    Endangered Minority and Regional Languages ('dialects') in Italy , volume =

    Coluzzi, Paolo , year =. Endangered Minority and Regional Languages ('dialects') in Italy , volume =. Modern Italy , doi =

  10. [10]

    Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification

    Scherrer, Yves and van der Goot, Rob and M hlum, Petter. Findings of the V ar D ial Evaluation Campaign 2025: The N or SID Shared Task on N orwegian Slot, Intent and Dialect Identification. Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects. 2025

  11. [11]

    Findings of the V ar D ial Evaluation Campaign 2023

    Aepli, No. Findings of the V ar D ial Evaluation Campaign 2023. Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023). 2023. doi:10.18653/v1/2023.vardial-1.25

  12. [12]

    Findings of the V ar D ial Evaluation Campaign 2022

    Aepli, No. Findings of the V ar D ial Evaluation Campaign 2022. Proceedings of the Ninth Workshop on NLP for Similar Languages, Varieties and Dialects. 2022

  13. [13]

    2022 , eprint=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

  14. [15]

    2024 , howpublished =

    Mistral-AI , title =. 2024 , howpublished =

  15. [16]

    2017 , howpublished =

    ISTAT , title =. 2017 , howpublished =

  16. [17]

    2026 , howpublished =

    ISTAT , title =. 2026 , howpublished =

  17. [18]

    2024 , eprint=

    EuroLLM: Multilingual Language Models for Europe , author=. 2024 , eprint=

  18. [19]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  19. [20]

    2025 , eprint=

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. 2025 , eprint=

  20. [21]

    2024 , eprint=

    Phi-4 Technical Report , author=. 2024 , eprint=

  21. [22]

    Maiden, Martin and Parry, Mair , title =

  22. [23]

    Journal on Ethnopolitics and Minority Issues in Europe , year =

    van der Jeught, Stefaan , title =. Journal on Ethnopolitics and Minority Issues in Europe , year =

  23. [24]

    Moseley, Christopher , title =

  24. [25]

    arXiv preprint arXiv:1910.09700 , year=

    Quantifying the Carbon Emissions of Machine Learning , author=. arXiv preprint arXiv:1910.09700 , year=

  25. [26]

    Proceedings of ISMTCL , pages =

    Delmonte, Rodolfo and Bristot, Antonella and Tonelli, Sara and Pianta, Emanuele , title =. Proceedings of ISMTCL , pages =. 2009 , address =

  26. [27]

    Una eina per a una llengua en proc

    Fronteddu, Gianfranco and Al. Una eina per a una llengua en proc. Linguam. 2017 , doi=

  27. [28]

    Intelligent Computing (SAI 2022) , pages =

    Wdowiak, Eryk , title =. Intelligent Computing (SAI 2022) , pages =. 2022 , address =

  28. [29]

    The Prague Bulletin of Mathematical Linguistics , volume=

    Rule-based machine translation for the Italian--Sardinian language pair , author=. The Prague Bulletin of Mathematical Linguistics , volume=. 2017 , publisher=

  29. [30]

    News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces

    J \"o rg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing. 2009

  30. [31]

    2024 , eprint=

    Transcending Language Boundaries: Harnessing LLMs for Low-Resource Language Translation , author=. 2024 , eprint=

  31. [32]

    Seth Aycock and David Stap and Di Wu and Christof Monz and Khalil Sima'an , booktitle=. Can. 2025 , url=

  32. [33]

    2024 , eprint=

    A Benchmark for Learning to Translate a New Language from One Grammar Book , author=. 2024 , eprint=

  33. [34]

    Experiments in Multi-Variant Natural Language Processing for N ahuatl

    Pugh, Robert and Tyers, Francis. Experiments in Multi-Variant Natural Language Processing for N ahuatl. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.12

  34. [35]

    Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish

    Simons, Andreas and De Pascale, Stefano and Franco, Karlien. Highly Granular Dialect Normalization and Phonological Dialect Translation for L imburgish. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.13

  35. [36]

    Modeling Orthographic Variation in O ccitan ' s Dialects

    Hopton, Zachary and Aepli, No. Modeling Orthographic Variation in O ccitan ' s Dialects. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.6

  36. [37]

    N-Gram-Based Text Categorization , journal =

    Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =

  37. [38]

    Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages

    Vicente, Aileen Joan and Cheng, Charibeth. Language Identification of P hilippine Creole S panish: Discriminating C havacano From Related Languages. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.16

  38. [39]

    One-Shot Prompt for Language Variety Identification

    Gillin, Nat. One-Shot Prompt for Language Variety Identification. Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024). 2024. doi:10.18653/v1/2024.vardial-1.20

  39. [40]

    V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification

    Chifu, Adrian-Gabriel and Glava s , Goran and Ionescu, Radu Tudor and Ljube s i \'c , Nikola and Mileti \'c , Aleksandra and Mileti \'c , Filip and Scherrer, Yves and Vuli \'c , Ivan. V ar D ial Evaluation Campaign 2024: Commonsense Reasoning in Dialects and Multi-Label Similar Language Identification. Proceedings of the Eleventh Workshop on NLP for Simil...

  40. [41]

    Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =

    Bednaříková, Emma and Rychlý, Pavel , title =. Recent Advances in Slavonic Natural Language Processing (RASLAN 2025) , editor =. 2025 , pages =

  41. [42]

    O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles

    Lison, Pierre and Tiedemann, J. O pen S ubtitles2016: Extracting Large Parallel Corpora from Movie and TV Subtitles. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

  42. [43]

    Bird, Steven and Klein, Ewan and Loper, Edward , isbn =

  43. [44]

    arXiv preprint arXiv:2509.06888 , year=

    mmbert: A modern multilingual encoder with annealed language learning , author=. arXiv preprint arXiv:2509.06888 , year=

  44. [45]

    URL https: //aclanthology.org/2025.acl-long.127/

    Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

  45. [46]

    Icml , volume=

    Conditional random fields: Probabilistic models for segmenting and labeling sequence data , author=. Icml , volume=. 2001 , organization=

  46. [47]

    The Denglisch Corpus of G erman- E nglish Code-Switching

    Osmelak, Doreen and Wintner, Shuly. The Denglisch Corpus of G erman- E nglish Code-Switching. Proceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP. 2023. doi:10.18653/v1/2023.sigtyp-1.5

  47. [48]

    Language Identification of Intra-Word Code-Switching for Arabic–English , journal =

    Caroline Sabty and Islam Mesabah and Özlem Çetinoğlu and Slim Abdennadher , keywords =. Language Identification of Intra-Word Code-Switching for Arabic–English , journal =. 2021 , issn =. doi:https://doi.org/10.1016/j.array.2021.100104 , url =

  48. [49]

    , author=

    A Hindi-English Code-Switching Corpus. , author=. LREC , pages=

  49. [50]

    G lot LID : Language Identification for Low-Resource Languages

    Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran c ois and Schuetze, Hinrich. G lot LID : Language Identification for Low-Resource Languages. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.410

  50. [51]

    2017 , url=

    FastText.zip: Compressing text classification models , author=. 2017 , url=

  51. [52]

    Bag of Tricks for Efficient Text Classification

    Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas. Bag of Tricks for Efficient Text Classification. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. 2017

  52. [53]

    1981 , address =

    Anesa, Marino and Rondi, Mario , title =. 1981 , address =

  53. [54]

    2000 , publisher=

    Bilingual Speech: A Typology of Code-mixing , author=. 2000 , publisher=

  54. [55]

    The Dialects of

    Berruto, Gaetano , title =. The Dialects of. 1997 , chapter =. doi:10.4324/9780203993880-46 , url =

  55. [56]

    Romania et Slavia adriatica

    Berruto, Gaetano , title =. Romania et Slavia adriatica. Festschrift für Zarko Muljačić , editor =. 1987 , pages =

  56. [57]

    Maiden, Martin and Perry, Mair , title =

  57. [58]

    Wardhaugh, Ronald , title =

  58. [59]

    Journal of Artificial Intelligence Research , volume=

    Automatic language identification in texts: A survey , author=. Journal of Artificial Intelligence Research , volume=

  59. [60]

    The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =

    Lavecchia, Caroline and Sma. The 4th International Workshop on Natural Language Processing and Cognitive Science - NLPCS 2007 , ADDRESS =. 2007 , MONTH = Jun, PDF =

  60. [61]

    Gumperz , title =

    John J. Gumperz , title =. RELC Journal , volume =. 1977 , doi =

  61. [62]

    Andreoli, Giulia , title =

  62. [63]

    Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech

    Massimo Cerruti. Code-switching in Italo-Romance: a variationist study of convergence in bilingual speech. Lingue e linguaggio, Rivista semestrale. 2018. doi:10.1418/90425

  63. [64]

    Frighetto, Federica , title =

  64. [65]

    Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =

    Dal Negro, Silvia and Ciccolone, Simone , booktitle =. Il parlato bilingue: Italiano e tedesco a contatto in un corpus sudtirolese , abstract =

  65. [66]

    Fiorentini, Ilaria , title =

  66. [67]

    2020 , eprint=

    Beyond English-Centric Multilingual Machine Translation , author=. 2020 , eprint=

  67. [68]

    2020 , eprint=

    CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , author=. 2020 , eprint=

  68. [69]

    W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia

    Schwenk, Holger and Chaudhary, Vishrav and Sun, Shuo and Gong, Hongyu and Guzm \'a n, Francisco. W iki M atrix: Mining 135 M Parallel Sentences in 1620 Language Pairs from W ikipedia. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021. doi:10.18653/v1/2021.eacl-main.115

  69. [70]

    XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment

    El-Kishky, Ahmed and Renduchintala, Adithya and Cross, James and Guzm \'a n, Francisco and Koehn, Philipp. XLE nt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.814

  70. [71]

    Parallel Data, Tools and Interfaces in OPUS

    Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. Proceedings of the Eighth International Conference on Language Resources and Evaluation ( LREC '12). 2012

  71. [72]

    roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=

    Piötòst Ché Niènt, Mèi Piötòst - A Manually Revised Lombard-Italian Parallel Corpus , author=. roceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2022 , pages=. 2022 , publisher=

  72. [73]

    Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , url =

    Pedro Javier. Asynchronous pipelines for processing huge corpora on medium to low resource infrastructures , series =. 2019 , abstract =. doi:10.14618/ids-pub-9021 , url =

  73. [74]

    C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

    Nguyen, Thuat and others. C ultura X : A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). 2024

  74. [75]

    Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=

    HPLT 3.0: Very Large-Scale Multilingual Resources for LLM and MT. Mono- and Bi-lingual Data, Multilingual Evaluation, and Pre-Trained Models , author=. 2025 , eprint=

  75. [76]

    Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki

    Imani, Ayyoob and others , editor = "Rogers, Anna and Boyd-Graber, Jordan and Okazaki, Naoaki", booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)", month = jul, year = "2023", address = "Toronto, Canada", publisher = "Association for Computational Linguistics", url = "https://aclan...

  76. [77]

    Scaling neural machine translation to 200 languages

    NLLB Team. Scaling neural machine translation to 200 languages. Nature. 2024. doi:10.1038/s41586-024-07335-x

  77. [78]

    arXiv preprint arXiv:2211.01786 , year=

    Crosslingual generalization through multitask finetuning , author=. arXiv preprint arXiv:2211.01786 , year=

  78. [79]

    GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation

    Jones, Alexander and Caswell, Isaac and Firat, Orhan and Saxena, Ishank. GATITOS : Using a New Multilingual Lexicon for Low-resource Machine Translation. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.26

  79. [80]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Small Data, Big Impact: Leveraging Minimal Data for Effective Machine Translation , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  80. [81]

    Larkin, Vladimir , title =

Showing first 80 references.