Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.
citing papers explorer
-
CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios
Introduces CHALIS benchmark dataset testing language ID on mutually intelligible cousin language pairs and orthographically noisy inputs, with evaluation showing existing systems struggle substantially.
-
CAT-Translate: Building Compact Open-Source Models for Japanese-English Translation
Compact 0.8B-7B models for bidirectional Japanese-English translation outperform large multilingual models on real-world domain benchmarks.