Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.
Neural Machine Translation of Rare Words with Subword Units
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study
Tokenizer fertility varies 1.6x across models on Ukrainian legal text, Qwen uses 60% more tokens than Llama-family models, zero-shot outperforms few-shot by up to 26 points, and pre-war classifiers lose 27.9 points on invasion-era decisions.