pith:ERBMSKCV
Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP
Stochastic tokenization during both pretraining and fine-tuning yields the best results in low-resource NLP tasks.
arxiv:2605.13436 v1 · 2026-05-13 · cs.CL · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ERBMSKCVHF5ETVX2DJW5NHOQD3}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more
Record completeness
Claims
Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings.
That the downsampled subsets of high-resource languages and the chosen evaluation tasks sufficiently represent truly low-resource scenarios, and that the modest morphological alignment gains explain the performance benefits.
Stochastic tokenization with BPE dropout during both pretraining and fine-tuning outperforms deterministic tokenization or fine-tuning-only dropout on low-resource NLP tasks.
References
Receipt and verification
| First computed | 2026-05-18T02:44:47.101101Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ERBMSKCVHF5ETVX2DJW5NHOQD3 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "95cf7dcc087fa1067e2a6e12016390653396469c8ef71be08c5541b0a847e0e6",
"cross_cats_sorted": [
"cs.LG"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-05-13T12:31:04Z",
"title_canon_sha256": "fe3fc4f0fe3e5638c42b357229d5dddc318e3b413ba0ca2cab87bff14dadd35a"
},
"schema_version": "1.0",
"source": {
"id": "2605.13436",
"kind": "arxiv",
"version": 1
}
}