Pith Number

pith:X2C2FBWI

pith:2026:X2C2FBWI24KOPMNXTBNRNITJGC

not attested not anchored not stored refs resolved

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Changdae Oh, Sharon Li, Shawn Im, Zhen Fang

Transformer weights emerge in closed form as compositions of three basis functions from corpus statistics.

arxiv:2601.19208 v2 · 2026-01-27 · cs.CL · cs.LG

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

each set of weights of the transformer has closed-form expressions as simple compositions of three basis functions (bigram, token-interchangeability, and context mappings), reflecting the statistics of the text corpus

C2weakest assumption

The leading-term approximation of the gradients remains accurate enough in the earliest training phase to determine the functional form of the learned weights and that semantic associations are primarily shaped by these early-stage closed-form expressions rather than later training dynamics.

C3one line summary

Transformer weights at early training stages are closed-form compositions of bigram, token-interchangeability, and context mappings that directly reflect text-corpus statistics and explain the emergence of semantic associations.

References

37 extracted · 37 resolved · 9 Pith anchors

[1] GPT-4 Technical Report · arXiv:2303.08774

[2] Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2026

[3] Sparse Autoencoders Find Highly Interpretable Features in Language Models · arXiv:2309.08600

[4] Computational-statistical gaps in gaussian single-index models

[5] How two-layer neural networks learn, one (giant) step at a time.arXiv preprint arXiv:2305.18270,

Receipt and verification

First computed	2026-05-18T03:09:24.221232Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

be85a286c8d714e7b1b7985b16a26930b70ef4c92891b3689d3c34bcd0d2775a

Aliases

arxiv: 2601.19208 · arxiv_version: 2601.19208v2 · doi: 10.48550/arxiv.2601.19208 · pith_short_12: X2C2FBWI24KO · pith_short_16: X2C2FBWI24KOPMNX · pith_short_8: X2C2FBWI

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/X2C2FBWI24KOPMNXTBNRNITJGC \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: be85a286c8d714e7b1b7985b16a26930b70ef4c92891b3689d3c34bcd0d2775a

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "cae22820d821c975e2c4ed748ed7d91a0a6a6cce4f6ecf69c7801741d88f9c7c",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-01-27T05:22:34Z",
    "title_canon_sha256": "a88eddb01807ed9da61f8c49f0b37a6fed22de9afae17680ac38450b01224444"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2601.19208",
    "kind": "arxiv",
    "version": 2
  }
}