Fast Vocabulary Transfer for Language Model Compression

Gee, Leonidas, Zugarini, Andrea, Rigutini, Leonardo, Torroni, Paolo · 2022 · DOI 10.18653/v1/2022.emnlp-industry.41

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open at publisher browse 3 citing papers

representative citing papers

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion

cs.CL · 2026-04-17 · unverdicted · novelty 7.0

Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.

Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective

cs.CR · 2026-04-20 · unverdicted · novelty 5.0

BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.

citing papers explorer

Showing 3 of 3 citing papers.

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment cs.CL · 2026-05-13 · unverdicted · none · ref 100
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Defragmenting Language Models: An Interpretability-based Approach for Vocabulary Expansion cs.CL · 2026-04-17 · unverdicted · none · ref 1
Interpretability-based selection of vocabulary items plus FragMend initialization reduces token over-fragmentation and improves performance for non-Latin script languages by roughly 20 points over baselines.
Understanding Secret Leakage Risks in Code LLMs: A Tokenization Perspective cs.CR · 2026-04-20 · unverdicted · none · ref 59
BPE tokenization creates gibberish bias in CLLMs, causing secrets with high character entropy but low token entropy to be preferentially memorized due to training data distribution shifts.

Fast Vocabulary Transfer for Language Model Compression

fields

years

verdicts

representative citing papers

citing papers explorer