hub Canonical reference

Proceedings of the Association for Computational Linguistics (ACL) , pages =

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán + 2 more · 2020 · Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics · DOI 10.18653/v1/2020.acl-main.747

Canonical reference. 86% of citing Pith papers cite this work as background.

42 Pith papers citing it

2,255 external citations · Crossref

Background 86% of classified citations

open at publisher browse 42 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 6 method 1

citation-polarity summary

background 6 use method 1

representative citing papers

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents

cs.CL · 2025-12-08 · accept · novelty 8.0

SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

cs.CL · 2023-04-03 · accept · novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.

Brain-LLM Alignment Tracks Training Data, Not Typology

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.

TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics

cs.OS · 2026-05-18 · unverdicted · novelty 7.0

TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.

100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts

cs.CL · 2026-05-09 · accept · novelty 7.0 · 2 refs

A new corpus of 100,502 annotated movie reviews from Kazakhstan enables sentiment analysis research in Russian, Kazakh, and code-switched texts.

Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

A framework with TOPPing source selection and VACAI-Bowl dual-branch model yields 54.62% average improvement in dependency parsing across 10 low-resource varieties.

Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages

cs.CL · 2026-05-05 · accept · novelty 7.0

Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.

Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

cs.CL · 2026-04-22 · unverdicted · novelty 7.0

Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.

C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to synthesize better instruction data.

WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection

cs.SI · 2026-03-25 · unverdicted · novelty 7.0

WhaVax is a new expert-annotated dataset of WhatsApp vaccine messages with benchmarks showing competitive performance from embeddings and LLMs for misinformation detection under data scarcity.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.

Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax

cs.CL · 2026-05-14 · unverdicted · novelty 6.0

Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.

ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset

cs.CL · 2026-05-13 · conditional · novelty 6.0

ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.

Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

Macro uses Direct Preference Optimization on composite-scored preference pairs to improve validity of multilingual self-generated counterfactual explanations by 12.55% on average without degrading minimality.

Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.

Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Context-Aligned Contrastive Regression combines cross-view context alignment and ordinal soft contrastive learning with ridge ensembles to improve lexical difficulty prediction across L1 backgrounds on three datasets.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

JFinTEB: Japanese Financial Text Embedding Benchmark

cs.IR · 2026-04-17 · unverdicted · novelty 6.0

JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

cs.CL · 2024-06-25 · unverdicted · novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

cs.CL · 2021-08-27 · unverdicted · novelty 6.0

ALiBi enables transformers trained on length-1024 sequences to extrapolate to length-2048 with the same perplexity as a sinusoidal model trained on 2048, while training 11% faster and using 11% less memory.

citing papers explorer

Showing 42 of 42 citing papers.

SwissGov-RSD: A Human-annotated, Cross-lingual Benchmark for Token-level Recognition of Semantic Differences Between Related Documents cs.CL · 2025-12-08 · accept · none · ref 12
SwissGov-RSD is the first naturalistic cross-lingual document-level benchmark with human token-level semantic difference annotations, on which both LLMs and encoders show a large performance gap relative to simpler settings.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling cs.CL · 2023-04-03 · accept · none · ref 101
Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
Brain-LLM Alignment Tracks Training Data, Not Typology cs.CL · 2026-05-21 · unverdicted · none · ref 9
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics cs.OS · 2026-05-18 · unverdicted · none · ref 18
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
100,000+ Movie Reviews from Kazakhstan: Russian, Kazakh, and Code-Switched Texts cs.CL · 2026-05-09 · accept · none · ref 16 · 2 links
A new corpus of 100,502 annotated movie reviews from Kazakhstan enables sentiment analysis research in Russian, Kazakh, and code-switched texts.
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties cs.CL · 2026-05-06 · unverdicted · none · ref 34
A framework with TOPPing source selection and VACAI-Bowl dual-branch model yields 54.62% average improvement in dependency parsing across 10 low-resource varieties.
Nsanku: Evaluating Zero-Shot Translation Performance of LLMs for Ghanaian Languages cs.CL · 2026-05-05 · accept · none · ref 24
Nsanku benchmark shows current LLMs achieve only modest zero-shot translation scores on 43 Ghanaian languages, with no model reaching both high average performance and high cross-language consistency.
Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL cs.CL · 2026-04-22 · unverdicted · none · ref 24
Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.
Exploring Language-Agnosticity in Function Vectors: A Case Study in Machine Translation cs.CL · 2026-04-21 · unverdicted · none · ref 65
Translation function vectors extracted from English to one target language improve correct token ranking for translations to multiple other unseen target languages in decoder-only multilingual LLMs.
C-Mining: Unsupervised Discovery of Seeds for Cultural Data Synthesis via Geometric Misalignment cs.CL · 2026-04-17 · unverdicted · none · ref 2
C-Mining automatically mines high-fidelity Culture Points from raw multilingual text by treating cross-lingual geometric isolation in embeddings as a quantifiable signal for cultural specificity, then uses them to synthesize better instruction data.
WhatsApp Vaccine Discourse (WhaVax): An Expert-Annotated Dataset and Benchmark for Health Misinformation Detection cs.SI · 2026-03-25 · unverdicted · none · ref 18
WhaVax is a new expert-annotated dataset of WhatsApp vaccine messages with benchmarks showing competitive performance from embeddings and LLMs for misinformation detection under data scarcity.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 92
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions cs.CL · 2026-05-22 · unverdicted · none · ref 9
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.
Reinforcement Learning with Semantic Rewards Enables Low-Resource Language Expansion without Alignment Tax cs.CL · 2026-05-14 · unverdicted · none · ref 37
Reinforcement learning with semantic rewards lets LLMs gain low-resource language skills without the alignment tax that degrades general capabilities in supervised fine-tuning.
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset cs.CL · 2026-05-13 · conditional · none · ref 49
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
Enhancing Multilingual Counterfactual Generation through Alignment-as-Preference Optimization cs.CL · 2026-05-12 · unverdicted · none · ref 10
Macro uses Direct Preference Optimization on composite-scored preference pairs to improve validity of multilingual self-generated counterfactual explanations by 12.55% on average without degrading minimality.
Dynamic Meta-Metrics: Source-Sentence Conditioned Weighting for MT Evaluation cs.CL · 2026-05-09 · unverdicted · none · ref 25
Dynamic Meta-Metrics learns source-sentence-conditioned combinations of MT metrics, with MLP-based hard and soft clustering versions outperforming static linear and Gaussian process ensembles on WMT data.
Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling cs.CL · 2026-05-09 · unverdicted · none · ref 29
Context-Aligned Contrastive Regression combines cross-view context alignment and ordinal soft contrastive learning with ridge ensembles to improve lexical difficulty prediction across L1 backgrounds on three datasets.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 20
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
JFinTEB: Japanese Financial Text Embedding Benchmark cs.IR · 2026-04-17 · unverdicted · none · ref 2
JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024-06-25 · unverdicted · none · ref 61
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 184
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 222
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation cs.CL · 2021-08-27 · unverdicted · none · ref 4
ALiBi enables transformers trained on length-1024 sequences to extrapolate to length-2048 with the same perplexity as a sinusoidal model trained on 2048, while training 11% faster and using 11% less memory.
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? cs.CL · 2026-05-14 · accept · none · ref 5 · 2 links
Fine-tuned LLM and explainable models predict vocabulary difficulty with correlations r > 0.91 and r > 0.77, showing spelling difficulty and test item construction as key influences in addition to word production difficulty.
The Impact of Vocabulary Overlaps on Knowledge Transfer in Multilingual Machine Translation cs.CL · 2026-05-05 · unverdicted · none · ref 42
Experiments show domain match and language relatedness drive knowledge transfer in multilingual MT more than vocabulary overlap.
Dependency Parsing Across the Resource Spectrum: Evaluating Architectures on High and Low-Resource Languages cs.CL · 2026-05-04 · unverdicted · none · ref 56
Biaffine LSTM outperforms transformer parsers like AfroXLMR and RemBERT in low-resource dependency parsing, with transformers gaining advantage as data increases and morphological complexity as a secondary predictor.
Knowledge-driven Augmentation and Retrieval for Integrative Temporal Adaptation cs.CL · 2026-04-23 · unverdicted · none · ref 17
KARITA integrates knowledge-driven augmentation and retrieval to improve classification performance under temporal shifts across clinical, legal, and scientific domains.
Detecting Hallucinations in SpeechLLMs at Inference Time Using Attention Maps cs.CL · 2026-04-21 · unverdicted · none · ref 6
Four attention metrics enable logistic regression classifiers that detect hallucinations in SpeechLLMs with up to +0.23 PR-AUC gains over baselines on ASR and translation tasks.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression cs.AI · 2026-04-21 · unverdicted · none · ref 148
LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing training costs.
Language, Place, and Social Media: Geographic Dialect Alignment in New Zealand cs.CL · 2026-04-17 · unverdicted · none · ref 56
New Zealand Reddit users link language to place and form contiguous speech communities with complex geographic alignment; Word2Vec embeddings reveal semantic variations and shifts in NZ English on a 4.26 billion word corpus.
MSMO-ABSA: Multi-Scale and Multi-Objective Optimization for Cross-Lingual Aspect-Based Sentiment Analysis cs.CL · 2025-02-19 · unverdicted · none · ref 10
MSMO framework achieves claimed SOTA cross-lingual ABSA via sentence- and aspect-level alignment, code-switching, consistency training, and knowledge distillation.
Exploring Cross-lingual Latent Transplantation: Mutual Opportunities and Open Challenges cs.CL · 2024-12-17 · unverdicted · none · ref 13
XTransplant empirically shows that cross-lingual latent transplantation yields mutual benefits for multilingual capability and cultural adaptability in LLMs, especially low-resource ones, while revealing underutilized model potential.
Multilingual E5 Text Embeddings: A Technical Report cs.CL · 2024-02-08 · unverdicted · none · ref 3
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities cs.CL · 2026-05-20 · accept · none · ref 6
The 2026 multilingual coreference shared task expanded to 27 datasets in 19 languages with a focus on long-range entities; traditional systems led but LLMs showed notable potential.
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model cs.CL · 2024-04-03 · unverdicted · none · ref 6
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.
ICT-NLP at SemEval-2026 Task 3: Less Is More -- Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression cs.CL · 2026-05-11 · unverdicted · none · ref 4
A lightweight multilingual encoder system with joint training and adaptive ensemble achieves top-half rankings across datasets in SemEval-2026 dimensional aspect sentiment regression.
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse cs.CL · 2026-05-04 · unverdicted · none · ref 18
An LLM ensemble reached 80 macro-F1 on 3-class clarity detection and 59 on 9-class evasion detection, with partial layer unfreezing and multilingual ensembles improving encoder results while enriched context helped only LLMs.
Scripts Through Time: A Survey of the Evolving Role of Transliteration in NLP cs.CL · 2026-04-20 · unverdicted · none · ref 37
A survey that taxonomizes motivations for transliteration in cross-lingual NLP, reviews incorporation approaches and their evolution, analyzes trade-offs in settings like code-mixing and language families, and offers implementation recommendations.
Multilingual Vision-Language Models, A Survey cs.CL · 2025-09-26 · accept · none · ref 31
The survey identifies a key tension in multilingual vision-language models between language neutrality via contrastive learning and cultural awareness via diverse data, with most benchmarks relying on translation-based evaluation.
cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu cs.CL · 2026-05-10 · unverdicted · none · ref 25
Organic domain adaptation of XLM-RoBERTa on Tulu social media text improves multi-class hope speech detection in code-mixed Tulu on the development set.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling cs.CL · 2026-05-07 · unverdicted · none · ref 33 · 2 links
A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

Proceedings of the Association for Computational Linguistics (ACL) , pages =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer