New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
6 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 6representative citing papers
FLEXITOKENS replaces rigid subword tokenizers and fixed-compression auxiliary losses with a simplified boundary-prediction objective in byte-level models, yielding lower over-fragmentation and up to 10-point gains on multilingual and domain-adaptation tasks.
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
Cross-lingual fine-tuning of pre-trained LMs yields significant gains on small gold Indonesian NER and competitive results on large silver data versus monolingual LM or POS transfer.
citing papers explorer
-
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
-
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
FLEXITOKENS replaces rigid subword tokenizers and fixed-compression auxiliary losses with a simplified boundary-prediction objective in byte-level models, yielding lower over-fragmentation and up to 10-point gains on multilingual and domain-adaptation tasks.
-
ATD-Trans: A Geographically Grounded Japanese-English Travelogue Translation Dataset
ATD-Trans is a new geographically annotated Japanese-English travelogue dataset that reveals Japanese-enhanced models perform better on geo-entity translation while domestic Japanese locations remain harder to translate accurately.
-
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
-
DynamicNER: A Dynamic, Multilingual, and Fine-Grained Dataset for LLM-based Named Entity Recognition
DynamicNER is a dynamic-categorization multilingual NER dataset with 155 entity types paired with CascadeNER, a two-stage lightweight LLM method claiming higher fine-grained accuracy.
-
Cross-Lingual Transfer for Distantly Supervised and Low-resources Indonesian NER
Cross-lingual fine-tuning of pre-trained LMs yields significant gains on small gold Indonesian NER and competitive results on large silver data versus monolingual LM or POS transfer.