FLEXITOKENS replaces rigid subword tokenizers and fixed-compression auxiliary losses with a simplified boundary-prediction objective in byte-level models, yielding lower over-fragmentation and up to 10-point gains on multilingual and domain-adaptation tasks.
Alabi, Yanke Mao, Haonan Gao, and Annie En-Shiun Lee
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3verdicts
UNVERDICTED 3representative citing papers
MOSAIC achieves mean macro F1 of 88 on chest X-ray report classification across five datasets in four languages using a 4B-parameter open model with low GPU memory and few-shot or light fine-tuning options.
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.
citing papers explorer
-
FLEXITOKENS: Flexible Tokenization for Evolving Language Models
FLEXITOKENS replaces rigid subword tokenizers and fixed-compression auxiliary losses with a simplified boundary-prediction objective in byte-level models, yielding lower over-fragmentation and up to 10-point gains on multilingual and domain-adaptation tasks.
-
MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification
MOSAIC achieves mean macro F1 of 88 on chest X-ray report classification across five datasets in four languages using a 4B-parameter open model with low GPU memory and few-shot or light fine-tuning options.
-
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.