hub Canonical reference

Unsupervised Cross-lingual Representation Learning at Scale

· 2019 · cs.CL · arXiv 1911.02116

Canonical reference. 86% of citing Pith papers cite this work as background.

29 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 29 citing papers arXiv PDF

abstract

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 1

citation-polarity summary

background 6 use method 1

representative citing papers

GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos

cs.CV · 2026-04-17 · unverdicted · novelty 7.0

GAViD is a new multimodal video dataset for context-aware group affect recognition, with CAGNet reaching 63.20% test accuracy comparable to prior state-of-the-art.

Human-Centered Supervision for Sentiment Analysis in Telugu: A Systematic Inquiry Beyond Accuracy

cs.CL · 2025-08-02 · unverdicted · novelty 7.0

Human rationales in supervision for Telugu sentiment analysis improve model alignment with human reasoning and often produce gains in predictive performance.

When Cultures Meet: Multicultural Text-to-Image Generation

cs.CV · 2025-02-21 · unverdicted · novelty 7.0

Introduces the first benchmark for multicultural text-to-image generation across five countries and a MosAIG multi-agent framework, showing that richer prompts improve quality but disparities persist across languages and demographics.

Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings

cs.LG · 2026-05-13 · conditional · novelty 6.0

Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.

COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset

cs.CL · 2026-04-13 · accept · novelty 6.0

KS-PRET-5M is a newly released 5.09 million word Kashmiri pretraining dataset containing 12.13 million subword tokens after MuRIL tokenization, made available as a continuous text stream under CC BY 4.0.

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

cs.CL · 2026-02-18 · unverdicted · novelty 6.0

CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

cs.RO · 2025-05-09 · unverdicted · novelty 6.0

UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.

How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

cs.CL · 2024-11-08 · unverdicted · novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition

cs.CL · 2024-09-17 · unverdicted · novelty 6.0

DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

cs.CL · 2024-06-25 · unverdicted · novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

Unsupervised Dense Information Retrieval with Contrastive Learning

cs.IR · 2021-12-16 · unverdicted · novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

cs.SE · 2021-02-09 · unverdicted · novelty 6.0

CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.

Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

English-pivot explanations for non-English LLM inputs achieve higher human span agreement but lower faithfulness, with comprehensiveness degrading up to 5.7x across tasks and languages.

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

cs.SE · 2026-05-14 · unverdicted · novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.

Automatic Reflection Level Classification in Hungarian Student Essays

cs.CL · 2026-05-04 · unverdicted · novelty 5.0

Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.

Multilingual Training and Evaluation Resources for Vision-Language Models

cs.CL · 2026-04-20 · conditional · novelty 5.0

Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

cs.CL · 2026-04-12 · unverdicted · novelty 5.0

A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.

'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family

cs.CL · 2026-04-04 · unverdicted · novelty 5.0

Layer-wise probing shows the degree to which Italian NPN constructions' form and meaning are reflected in BERT contextual embeddings.

VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering

cs.IR · 2026-01-16 · unverdicted · novelty 5.0

VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.

Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters

cs.CL · 2025-10-16 · conditional · novelty 5.0

A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.

The Role of Vocabularies in Learning Sparse Representations for Ranking

cs.IR · 2025-09-20 · unverdicted · novelty 5.0

Larger 100K vocabularies in SPLADE models, especially those initialized with ESPLADE pretraining, improve retrieval effectiveness after pruning compared to 32K baselines while keeping similar efficiency.

Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments

cs.CL · 2025-02-01 · unverdicted · novelty 5.0

A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.

citing papers explorer

Showing 29 of 29 citing papers.

GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos cs.CV · 2026-04-17 · unverdicted · none · ref 56 · internal anchor
GAViD is a new multimodal video dataset for context-aware group affect recognition, with CAGNet reaching 63.20% test accuracy comparable to prior state-of-the-art.
Human-Centered Supervision for Sentiment Analysis in Telugu: A Systematic Inquiry Beyond Accuracy cs.CL · 2025-08-02 · unverdicted · none · ref 8 · internal anchor
Human rationales in supervision for Telugu sentiment analysis improve model alignment with human reasoning and often produce gains in predictive performance.
When Cultures Meet: Multicultural Text-to-Image Generation cs.CV · 2025-02-21 · unverdicted · none · ref 1 · internal anchor
Introduces the first benchmark for multicultural text-to-image generation across five countries and a MosAIG multi-agent framework, showing that richer prompts improve quality but disparities persist across languages and demographics.
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings cs.LG · 2026-05-13 · conditional · none · ref 2 · internal anchor
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling cs.LG · 2026-04-22 · unverdicted · none · ref 104 · internal anchor
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset cs.CL · 2026-04-13 · accept · none · ref 13 · internal anchor
KS-PRET-5M is a newly released 5.09 million word Kashmiri pretraining dataset containing 12.13 million subword tokens after MuRIL tokenization, made available as a continuous text stream under CC BY 4.0.
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models cs.CL · 2026-02-18 · unverdicted · none · ref 59 · internal anchor
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions cs.RO · 2025-05-09 · unverdicted · none · ref 18 · internal anchor
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP cs.CL · 2024-11-08 · unverdicted · none · ref 15 · internal anchor
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition cs.CL · 2024-09-17 · unverdicted · none · ref 10 · internal anchor
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024-06-25 · unverdicted · none · ref 28 · internal anchor
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 25 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Unsupervised Dense Information Retrieval with Contrastive Learning cs.IR · 2021-12-16 · unverdicted · none · ref 125 · internal anchor
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation cs.SE · 2021-02-09 · unverdicted · none · ref 13 · internal anchor
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations cs.CL · 2026-05-19 · unverdicted · none · ref 2 · internal anchor
English-pivot explanations for non-English LLM inputs achieve higher human span agreement but lower faithfulness, with comprehensiveness degrading up to 5.7x across tasks and languages.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks cs.SE · 2026-05-14 · unverdicted · none · ref 8 · internal anchor
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Automatic Reflection Level Classification in Hungarian Student Essays cs.CL · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
Multilingual Training and Evaluation Resources for Vision-Language Models cs.CL · 2026-04-20 · conditional · none · ref 8 · internal anchor
Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance cs.CL · 2026-04-12 · unverdicted · none · ref 7 · internal anchor
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.
'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family cs.CL · 2026-04-04 · unverdicted · none · ref 12 · internal anchor
Layer-wise probing shows the degree to which Italian NPN constructions' form and meaning are reflected in BERT contextual embeddings.
VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering cs.IR · 2026-01-16 · unverdicted · none · ref 55 · internal anchor
VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters cs.CL · 2025-10-16 · conditional · none · ref 2 · internal anchor
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
The Role of Vocabularies in Learning Sparse Representations for Ranking cs.IR · 2025-09-20 · unverdicted · none · ref 2 · internal anchor
Larger 100K vocabularies in SPLADE models, especially those initialized with ESPLADE pretraining, improve retrieval effectiveness after pruning compared to 32K baselines while keeping similar efficiency.
Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments cs.CL · 2025-02-01 · unverdicted · none · ref 29 · internal anchor
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
Attributing Culture-Conditioned Generations to Pretraining Corpora cs.CL · 2024-12-30 · unverdicted · none · ref 4 · internal anchor
MEMOed framework attributes LLM generations about cultures to pretraining memorization and finds frequency-based biases across 110 cultures for food and clothing.
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task cs.CL · 2026-04-16 · unverdicted · none · ref 19 · internal anchor
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF cs.CL · 2026-05-05 · unverdicted · none · ref 27 · 2 links · internal anchor
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research cs.CL · 2024-11-30 · unverdicted · none · ref 23 · internal anchor
This survey paper identifies opportunities for LLMs in low-resource language humanities research along with challenges in data accessibility, model adaptability, and cultural sensitivity.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan cs.CL · 2026-05-09 · unreviewed · ref 18 · internal anchor

Unsupervised Cross-lingual Representation Learning at Scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer