pith. sign in
Pith Number

pith:ERBMSKCV

pith:2026:ERBMSKCVHF5ETVX2DJW5NHOQD3
not attested not anchored not stored refs resolved

Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

Marcel Dunaiski, Ruan Visser, Trienko Grobler

Stochastic tokenization during both pretraining and fine-tuning yields the best results in low-resource NLP tasks.

arxiv:2605.13436 v1 · 2026-05-13 · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ERBMSKCVHF5ETVX2DJW5NHOQD3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings.

C2weakest assumption

That the downsampled subsets of high-resource languages and the chosen evaluation tasks sufficiently represent truly low-resource scenarios, and that the modest morphological alignment gains explain the performance benefits.

C3one line summary

Stochastic tokenization with BPE dropout during both pretraining and fine-tuning outperforms deterministic tokenization or fine-tuning-only dropout on low-resource NLP tasks.

References

34 extracted · 34 resolved · 1 Pith anchors

[1] [Adelaniet al., 2022 ] David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beuk- man, Chester Palen-Michel, Constantine Lignos, Je- sujoba O. Alabi, Shamsuddeen H. Muhamma 2022
[2] [Arnett and Bergen, 2025] Catherine Arnett and Ben- jamin K 2025
[3] Morphynet: A large multilin- gual database of derivational and inflectional morphol- ogy 2021
[4] BPE-knockout: Pruning pre-existing BPE tokenisers with backwards-compatible morpho- logical semi-supervision 2024
[5] [Cognettaet al., 2024 ] Marco Cognetta, Vil ´em Zouhar, and Naoaki Okazaki 2024
Receipt and verification
First computed 2026-05-18T02:44:47.101101Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb

Aliases

arxiv: 2605.13436 · arxiv_version: 2605.13436v1 · doi: 10.48550/arxiv.2605.13436 · pith_short_12: ERBMSKCVHF5E · pith_short_16: ERBMSKCVHF5ETVX2 · pith_short_8: ERBMSKCV
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ERBMSKCVHF5ETVX2DJW5NHOQD3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "95cf7dcc087fa1067e2a6e12016390653396469c8ef71be08c5541b0a847e0e6",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T12:31:04Z",
    "title_canon_sha256": "fe3fc4f0fe3e5638c42b357229d5dddc318e3b413ba0ca2cab87bff14dadd35a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13436",
    "kind": "arxiv",
    "version": 1
  }
}