MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource
2 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
citing papers explorer
-
MultiHashFormer: Hash-based Generative Language Models
MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
-
LangMAP: A Language-Adaptive Approach to Tokenization
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.