pith. sign in
Pith Number

pith:J2SUWQ2V

pith:2026:J2SUWQ2VUT6GSTP7Z7NLPYX6G2
not attested not anchored not stored refs resolved

InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

Dong Chen, Fangyun Wei, Gao Huang, Jiayi Guo, Ji Li, Jinjing Zhao, Lei Shi, Li Chen, Tianyu He, Yang Yue, Yue Dong, Zanlin Ni, Zeyu Liu

InsightTok uses localized content-aware perceptual losses to improve text and face fidelity in discrete image tokenizers.

arxiv:2605.14333 v1 · 2026-05-14 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{J2SUWQ2VUT6GSTP7Z7NLPYX6G2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details.

C2weakest assumption

That localized content-aware perceptual losses will reliably capture fine-grained text legibility and facial fidelity across diverse images without introducing new artifacts or requiring extensive hyperparameter tuning for each domain.

C3one line summary

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.

References

60 extracted · 60 resolved · 14 Pith anchors

[1] Adam: A Method for Stochastic Optimization 2014 · arXiv:1412.6980
[2] Cosmos World Foundation Model Platform for Physical AI 2025 · arXiv:2501.03575
[3] Flextok: Resampling images into 1d token sequences of flexible length 2025
[4] Scene text recognition with permuted autoregressive sequence models 2022
[5] Faces and text attract gaze independent of the task: Experimental data and computer model.Journal of vision, 9(12):10–10, 2009 2009

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-17T23:39:08.268499Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

4ea54b4355a4fc694dffcfdab7e2fe36ade01ef7c164645fae48d7b0fb70a20f

Aliases

arxiv: 2605.14333 · arxiv_version: 2605.14333v1 · doi: 10.48550/arxiv.2605.14333 · pith_short_12: J2SUWQ2VUT6G · pith_short_16: J2SUWQ2VUT6GSTP7 · pith_short_8: J2SUWQ2V
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/J2SUWQ2VUT6GSTP7Z7NLPYX6G2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 4ea54b4355a4fc694dffcfdab7e2fe36ade01ef7c164645fae48d7b0fb70a20f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4ef55016f8f82c403eac1edf7860c2366b261416c6e96612c37953b8096c777d",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2026-05-14T03:57:25Z",
    "title_canon_sha256": "bc892d40b41de6034c550b6e8f46c1fc03a36e85ee9702295e73845da32ae016"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14333",
    "kind": "arxiv",
    "version": 1
  }
}