hub

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

Phillip Rust, Jonas Pfeiffer, Ivan Vulic, Sebastian Ruder, Iryna Gurevych , editor = · 2021 · DOI 10.18653/v1/2021.acl-long.243

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

open at publisher browse 10 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

representative citing papers

Brain-LLM Alignment Tracks Training Data, Not Typology

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

StarCoder: may the source be with you!

cs.CL · 2023-05-09 · accept · novelty 5.0

StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.

MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization

cs.CL · 2026-04-23 · unverdicted · novelty 4.0

A language-adaptive combination of generalist, specialist, and ensemble transformer models achieves 0.796 macro F1 and 0.826 accuracy on multilingual polarization detection across 22 languages.

YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling

cs.CL · 2026-05-07 · unverdicted · novelty 2.0 · 2 refs

A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.

Tokenization with Split Trees

cs.CL · 2026-05-21

Compute Optimal Tokenization

cs.CL · 2026-05-02

Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation

cs.CL · 2026-04-20

citing papers explorer

Showing 10 of 10 citing papers.

Brain-LLM Alignment Tracks Training Data, Not Typology cs.CL · 2026-05-21 · unverdicted · none · ref 22
Training-language dominance, not English inherent properties, determines brain-LLM alignment across English, Chinese, and French, with additional independent effects from typological distance concentrated in syntactic brain regions.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 250
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization cs.CL · 2026-05-17 · unverdicted · none · ref 32
Vocabulary adaptation via targeted token addition and replacement improves semantic similarity, domain word usage, and training efficiency for LLM summarization in legal and medical domains.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 125
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 194
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization cs.CL · 2026-04-23 · unverdicted · none · ref 28
A language-adaptive combination of generalist, specialist, and ensemble transformer models achieves 0.796 macro F1 and 0.826 accuracy on multilingual polarization detection across 22 languages.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling cs.CL · 2026-05-07 · unverdicted · none · ref 51 · 2 links
A heterogeneous ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base with independent task modeling and class weighting is reported as effective for multilingual, multicultural, and multievent online polarization detection.
Tokenization with Split Trees cs.CL · 2026-05-21 · unreviewed · ref 66
Compute Optimal Tokenization cs.CL · 2026-05-02 · unreviewed · ref 28
Syntax as a Rosetta Stone: Universal Dependencies for In-Context Coptic Translation cs.CL · 2026-04-20 · unreviewed · ref 24

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer