pith. sign in
Pith Number

pith:RNLXT3JA

pith:2026:RNLXT3JAGTMBVMZZMRC3AOGBQP
not attested not anchored not stored refs resolved

Unified Pix Token And Word Token Generative Language Model

Haun Leung, Zinan Wang

A new generative language model assigns each image pixel its own token to unify visual and textual inputs.

arxiv:2605.14028 v1 · 2026-05-13 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RNLXT3JAGTMBVMZZMRC3AOGBQP}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

The new model unifies pix token and word token into the generative language model... The experimental results show that it has good performance even in small model and with limited training data.

C2weakest assumption

That assigning each pixel its own token embedding plus the added color folding and global conditional attention approximation will produce meaningfully better visual detail understanding than existing patch-based encoders, without any quantitative comparison or ablation shown in the provided text.

C3one line summary

A new model unifies per-pixel and word tokens in a generative language model with per-pixel embeddings, color folding, and unsupervised image pretraining, reporting good performance on small models with limited data.

References

15 extracted · 15 resolved · 0 Pith anchors

[1] Neural Information Processing Systems , year=
[2] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , author=. 2020 , journal= 2020
[3] Learning Transferable Visual Models From Natural Language Supervision , author=. 2021 , eprint= 2021
[4] Sigmoid Loss for Language Image Pre-Training , author=. 2023 , eprint= 2023
[5] Visual Instruction Tuning , author=. 2023 , eprint= 2023
Receipt and verification
First computed 2026-05-17T23:39:12.875309Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

8b5779ed2034d81ab3396445b038c183e8e66da75bafbd54f17df70b94c91f3a

Aliases

arxiv: 2605.14028 · arxiv_version: 2605.14028v1 · doi: 10.48550/arxiv.2605.14028 · pith_short_12: RNLXT3JAGTMB · pith_short_16: RNLXT3JAGTMBVMZZ · pith_short_8: RNLXT3JA
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RNLXT3JAGTMBVMZZMRC3AOGBQP \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8b5779ed2034d81ab3396445b038c183e8e66da75bafbd54f17df70b94c91f3a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "603b2c502b29c231ff054044abb6165cf7addcfe829fdc9742523fdcb110dc9e",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-13T18:38:51Z",
    "title_canon_sha256": "a07c99d9289337019c324237fce5d3d64bca7473b659b40301115ca0128b9263"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14028",
    "kind": "arxiv",
    "version": 1
  }
}