MIDI is a new multilingual idiom dataset with sentence and conversational contexts; benchmarking reveals worse performance in low-resource languages and on literal vs. figurative uses.
Chen Liu, Fajri Koto, Timothy Baldwin, and Iryna Gurevych
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
method 1polarities
use method 1representative citing papers
MultiSynt/MT supplies 4.8 trillion translated tokens in 36 languages from 100B English tokens, letting LLMs match native-data baselines with 72% fewer tokens and beat them by 15% at equal budget.
Empirical study across 10 tasks showing bias inheritance from LLM-augmented data harms related downstream performance, with three misalignment factors and three mitigation strategies identified.
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.
citing papers explorer
-
Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages
MIDI is a new multilingual idiom dataset with sentence and conversational contexts; benchmarking reveals worse performance in low-resource languages and on literal vs. figurative uses.
-
MultiSynt/MT: Trillion-Token Multi-Parallel Pre-Training Data Translated Across 36 Languages
MultiSynt/MT supplies 4.8 trillion translated tokens in 36 languages from 100B English tokens, letting LLMs match native-data baselines with 72% fewer tokens and beat them by 15% at equal budget.
-
Understanding and Mitigating Bias Inheritance in LLM-based Data Augmentation on Downstream Tasks
Empirical study across 10 tasks showing bias inheritance from LLM-augmented data harms related downstream performance, with three misalignment factors and three mitigation strategies identified.
-
Anthropogenic Regional Adaptation in Multimodal Vision-Language Model
Anthropogenic Regional Adaptation with GG-EZ improves cultural relevance in multimodal vision-language models for Southeast Asia by 5-15% while retaining over 98% of global performance.