CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Dan Garrette; Iulia Turc; John Wieting; Jonathan H. Clark

arxiv: 2103.06874 · v4 · pith:CQFUYTTCnew · submitted 2021-03-11 · 💻 cs.CL · cs.LG

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

Jonathan H. Clark , Dan Garrette , Iulia Turc , John Wieting This is my paper

classification 💻 cs.CL cs.LG

keywords caninemodeltokenizationdirectlyencoderexplicitinputneural

0 comments

read the original abstract

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MimeLens: Position-Agnostic Content-Type Detection for Binary Fragments
cs.CR 2026-06 unverdicted novelty 7.0

MimeLens uses position-agnostic BERT encoders pretrained on random-offset binary windows to output one of 125 libmagic MIME labels, beating Magika on full files and enabling accurate classification on mid-file fragments.
The Tokenizer Tax Across 25 European Languages: Domain Invariance, Cross-Lingual Few-Shot Effects, and the Ukrainian Penalty
cs.CL 2026-05 unverdicted novelty 6.0

Tokenizer fertility varies 2.5x across 25 European languages with domain-invariant rankings, morphological fragmentation in high-fertility cases, and a Ukrainian penalty from pre-training underrepresentation.