Pith Number

pith:ERBMSKCV

pith:2026:ERBMSKCVHF5ETVX2DJW5NHOQD3

not attested not anchored not stored refs resolved

Pretraining Language Models with Subword Regularization: An Empirical Study of BPE Dropout in Low-Resource NLP

Marcel Dunaiski, Ruan Visser, Trienko Grobler

Stochastic tokenization during both pretraining and fine-tuning yields the best results in low-resource NLP tasks.

arxiv:2605.13436 v1 · 2026-05-13 · cs.CL · cs.LG

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{ERBMSKCVHF5ETVX2DJW5NHOQD3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Across tasks, the best results are typically obtained when stochastic tokenization is applied during both pretraining and fine-tuning, whereas applying BPE dropout only during fine-tuning can underperform deterministic tokenization in smaller-data settings.

C2weakest assumption

That the downsampled subsets of high-resource languages and the chosen evaluation tasks sufficiently represent truly low-resource scenarios, and that the modest morphological alignment gains explain the performance benefits.

C3one line summary

Stochastic tokenization with BPE dropout during both pretraining and fine-tuning outperforms deterministic tokenization or fine-tuning-only dropout on low-resource NLP tasks.

References

34 extracted · 34 resolved · 1 Pith anchors

[1] [Adelaniet al., 2022 ] David Ifeoluwa Adelani, Graham Neubig, Sebastian Ruder, Shruti Rijhwani, Michael Beuk- man, Chester Palen-Michel, Constantine Lignos, Je- sujoba O. Alabi, Shamsuddeen H. Muhamma 2022

[2] [Arnett and Bergen, 2025] Catherine Arnett and Ben- jamin K 2025

[3] Morphynet: A large multilin- gual database of derivational and inflectional morphol- ogy 2021

[4] BPE-knockout: Pruning pre-existing BPE tokenisers with backwards-compatible morpho- logical semi-supervision 2024

[5] [Cognettaet al., 2024 ] Marco Cognetta, Vil ´em Zouhar, and Naoaki Okazaki 2024

Receipt and verification

First computed	2026-05-18T02:44:47.101101Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb

Aliases

arxiv: 2605.13436 · arxiv_version: 2605.13436v1 · doi: 10.48550/arxiv.2605.13436 · pith_short_12: ERBMSKCVHF5E · pith_short_16: ERBMSKCVHF5ETVX2 · pith_short_8: ERBMSKCV

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/ERBMSKCVHF5ETVX2DJW5NHOQD3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2442c92855397a49d6fa1a6dd69dd01ed208fffb1703e0dd7cb1e5486fb175bb

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "95cf7dcc087fa1067e2a6e12016390653396469c8ef71be08c5541b0a847e0e6",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T12:31:04Z",
    "title_canon_sha256": "fe3fc4f0fe3e5638c42b357229d5dddc318e3b413ba0ca2cab87bff14dadd35a"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13436",
    "kind": "arxiv",
    "version": 1
  }
}