hub

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , booktitle =

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , author= · 2021 · DOI 10.18653/v1/2021.acl-long.243

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

open at publisher browse 20 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

ALEE: Any-Language Evaluation of Embeddings via English-Centric Minimal Pairs

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.

MinGram: A Minimalist Unigram Tokenizer with High Compression and Competitive Morphological Alignment

cs.CL · 2026-06-25 · unverdicted · novelty 7.0

MinGram is a simplified Unigram tokenizer training method that prioritizes token count minimization to deliver higher compression than BPE and standard Unigram while retaining competitive morphological alignment and superior bits-per-byte performance in language model training.

LangMAP: A Language-Adaptive Approach to Tokenization

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.

Sycophancy as a Multilingual Alignment Failure: How Safety Degrades Across Languages, Topics, and Models

cs.CL · 2026-06-07 · unverdicted · novelty 7.0

Evaluation across 1.1 million instances shows sycophancy rates spike in low-resource languages, remain topic-agnostic, and correlate with tokenizer fertility.

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.

Brain-LLM Alignment Tracks Training Data, Not Typology

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.

Tokenization with Split Trees

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

ToaST uses vocabulary-independent split trees and integer programming to produce tokenizers with over 11% fewer tokens than BPE, WordPiece, and UnigramLM while improving 1.5B-parameter LM scores on CORE.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

DEPART: DEcomposing PARiTy across Multilingual LLMs

cs.CL · 2026-05-27 · unverdicted · novelty 6.0

A Bayesian framework decomposes mLLM variance, showing language features explain 79-92% of language identity variance and that model identity vs. benchmark-model interactions dominate differently for understanding versus reasoning tasks.

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.

Compute Optimal Tokenization

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

In compute-optimal regimes, language model parameter count scales proportionally with data bytes rather than tokens, and the optimal compression rate decreases with increasing compute.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

Toten: A Knowledge-Based System For Structure-Preserving Representation Of Physical Quantities And Technical Notation In Brazilian Portuguese

cs.AI · 2026-06-17 · unverdicted · novelty 5.0

TOTEN is a knowledge-based system for structure-preserving representation of physical quantities and technical notation in Brazilian Portuguese using an ontology of engineering entities and external authorities, outperforming statistical baselines in atomicity and reconstruction.

Beyond Perplexity: UTF-8 Validity in Byte-aware Language Models

cs.CL · 2026-06-12 · unverdicted · novelty 5.0

A 355M-parameter byte-level LM on 80B multilingual tokens exhibits UTF-8 validity converging after 4.2B tokens versus 2.1B for perplexity, with higher validity on rare characters than common ones.

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

cs.MM · 2026-05-29 · unverdicted · novelty 5.0

Pilot evaluation of language-specific versus multilingual LoRA adapters on Qwen2.5-VL-3B for curator-guided BLV art descriptions in three languages.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

Modular Monolingual Adaptation using Pretrained Language Models

cs.CL · 2026-06-04 · unverdicted · novelty 4.0

Replacing tokens, freezing the corresponding embeddings, and tuning the rest of the model improves NLU performance on low-resource languages compared to full fine-tuning.

MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

cs.CL · 2026-04-23 · unverdicted · novelty 4.0

A language-adaptive combination of generalist, specialist, and ensemble transformer models achieves 0.796 macro F1 and 0.826 accuracy on multilingual polarization detection across 22 languages.

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

cs.CL · 2026-05-07 · unverdicted · novelty 2.0 · 2 refs

A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

cs.CL · 2026-04-20

citing papers explorer

Showing 1 of 1 citing paper after filters.

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation cs.CL · 2026-04-20 · unreviewed · ref 24

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models , booktitle =

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer