GAViD is a new multimodal video dataset for context-aware group affect recognition, with CAGNet reaching 63.20% test accuracy comparable to prior state-of-the-art.
hub Canonical reference
Unsupervised Cross-lingual Representation Learning at Scale
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6% average accuracy on XNLI, +13% average F1 score on MLQA, and +2.4% F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7% in XNLI accuracy for Swahili and 11.4% for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code, data and models publicly available.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Human rationales in supervision for Telugu sentiment analysis improve model alignment with human reasoning and often produce gains in predictive performance.
Introduces the first benchmark for multicultural text-to-image generation across five countries and a MosAIG multi-agent framework, showing that richer prompts improve quality but disparities persist across languages and demographics.
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
KS-PRET-5M is a newly released 5.09 million word Kashmiri pretraining dataset containing 12.13 million subword tokens after MuRIL tokenization, made available as a continuous text stream under CC BY 4.0.
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
English-pivot explanations for non-English LLM inputs achieve higher human span agreement but lower faithfulness, with comprehensiveness degrading up to 5.7x across tasks and languages.
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.
Layer-wise probing shows the degree to which Italian NPN constructions' form and meaning are reflected in BERT contextual embeddings.
VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
Larger 100K vocabularies in SPLADE models, especially those initialized with ESPLADE pretraining, improve retrieval effectiveness after pruning compared to 32K baselines while keeping similar efficiency.
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
citing papers explorer
-
GAViD: A Large-Scale Multimodal Dataset for Context-Aware Group Affect Recognition from Videos
GAViD is a new multimodal video dataset for context-aware group affect recognition, with CAGNet reaching 63.20% test accuracy comparable to prior state-of-the-art.
-
Human-Centered Supervision for Sentiment Analysis in Telugu: A Systematic Inquiry Beyond Accuracy
Human rationales in supervision for Telugu sentiment analysis improve model alignment with human reasoning and often produce gains in predictive performance.
-
When Cultures Meet: Multicultural Text-to-Image Generation
Introduces the first benchmark for multicultural text-to-image generation across five countries and a MosAIG multi-agent framework, showing that richer prompts improve quality but disparities persist across languages and demographics.
-
Mix, Don't Tune: Bilingual Pre-Training Outperforms Hyperparameter Search in Data-Constrained Settings
Mixing auxiliary high-resource language data outperforms hyperparameter tuning in data-constrained bilingual pre-training, with gains equivalent to 2-13 times more unique target data.
-
COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling
COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.
-
ks-pret-5m: a 5 million word, 12 million token kashmiri pretraining dataset
KS-PRET-5M is a newly released 5.09 million word Kashmiri pretraining dataset containing 12.13 million subword tokens after MuRIL tokenization, made available as a continuous text stream under CC BY 4.0.
-
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
-
UniVLA: Learning to Act Anywhere with Task-centric Latent Actions
UniVLA trains cross-embodiment vision-language-action policies from unlabeled videos via a latent action model in DINO space, beating OpenVLA on benchmarks with 1/20th pretraining compute and 1/10th downstream data.
-
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
-
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
-
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Unsupervised Dense Information Retrieval with Contrastive Learning
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
-
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
-
Lost in Interpretation: The Plausibility-Faithfulness Trade-off in Cross-Lingual Explanations
English-pivot explanations for non-English LLM inputs achieve higher human span agreement but lower faithfulness, with comprehensiveness degrading up to 5.7x across tasks and languages.
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
Automatic Reflection Level Classification in Hungarian Student Essays
Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare classes better.
-
Multilingual Training and Evaluation Resources for Vision-Language Models
Releases regenerated multilingual training data and translated benchmarks for VLMs in five languages and demonstrates consistent benefits from multilingual training over English-only baselines.
-
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.
-
'Layer su Layer': Identifying and Disambiguating the Italian NPN Construction in BERT's family
Layer-wise probing shows the degree to which Italian NPN constructions' form and meaning are reflected in BERT contextual embeddings.
-
VerifAI: A Verifiable Open-Source Search Engine for Biomedical Question Answering
VerifAI is an open-source biomedical QA system that decomposes generated answers into claims and verifies them with a fine-tuned NLI engine to reduce hallucinations and provide traceable citations.
-
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
-
The Role of Vocabularies in Learning Sparse Representations for Ranking
Larger 100K vocabularies in SPLADE models, especially those initialized with ESPLADE pretraining, improve retrieval effectiveness after pruning compared to 32K baselines while keeping similar efficiency.
-
Social media polarization during conflict: Insights from an ideological stance dataset on Israel-Palestine Reddit comments
A new labeled dataset of 9,969 Israel-Palestine Reddit comments is created and used to compare stance classification methods, with a specific Mixtral prompt achieving the highest performance.
-
Attributing Culture-Conditioned Generations to Pretraining Corpora
MEMOed framework attributes LLM generations about cultures to pretraining memorization and finds frequency-based biases across 110 cultures for food and clothing.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
-
Natural Language Processing: A Comprehensive Practical Guide from Tokenisation to RLHF
The work provides a reproducible, session-based guide to the NLP pipeline with original adaptations and resources for morphologically rich low-resource languages.
-
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research
This survey paper identifies opportunities for LLMs in low-resource language humanities research along with challenges in data accessibility, model adaptability, and cultural sensitivity.
- Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan