pith. sign in
Pith Number

pith:FMR6TAKP

pith:2021:FMR6TAKPY4GK7B2J6PENHAJL7N
not attested not anchored not stored refs resolved

Vector-quantized Image Modeling with Improved VQGAN

Alexander Ku, Han Zhang, James Qin, Jason Baldridge, Jiahui Yu, Jing Yu Koh, Ruoming Pang, Xin Li, Yonghui Wu, Yuanzhong Xu

An improved ViT-VQGAN produces discrete image tokens that let an autoregressive Transformer reach an Inception Score of 175.1 and FID of 4.17 on ImageNet.

arxiv:2110.04627 v3 · 2021-10-09 · cs.CV · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{FMR6TAKPY4GK7B2J6PENHAJL7N}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

When trained on ImageNet at 256×256 resolution, we achieve Inception Score (IS) of 175.1 and Fréchet Inception Distance (FID) of 4.17, a dramatic improvement over the vanilla VQGAN, which obtains 70.6 and 17.04 for IS and FID, respectively.

C2weakest assumption

That the discrete tokens produced by the improved ViT-VQGAN retain enough visual information for autoregressive modeling to succeed at both high-quality generation and strong unsupervised representations without critical loss of detail or mode collapse.

C3one line summary

Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.

References

75 extracted · 75 resolved · 14 Pith anchors

[1] Schwing, Jan Kautz, and Arash Vahdat 2010
[2] Learning representations by maximizing mutual information across views 1906 · arXiv:1906.00910
[3] Towards causal benchmarking of bias in face analysis algorithms, 2020 2020
[4] BEiT: BERT Pre-Training of Image Transformers 2021 · arXiv:2106.08254
[5] Large scale gan training for high fidelity natural image synthesis 2019

Formal links

2 machine-checked theorem links

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.953044Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

2b23e9814fc70caf8749f3c8d3812bfb427c20561b0eb78e59e0f5adcfb00686

Aliases

arxiv: 2110.04627 · arxiv_version: 2110.04627v3 · doi: 10.48550/arxiv.2110.04627 · pith_short_12: FMR6TAKPY4GK · pith_short_16: FMR6TAKPY4GK7B2J · pith_short_8: FMR6TAKP
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/FMR6TAKPY4GK7B2J6PENHAJL7N \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 2b23e9814fc70caf8749f3c8d3812bfb427c20561b0eb78e59e0f5adcfb00686
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "1281174dc868bc9ba29a7618a8026ba8d4e9294520a78257df36a7f3417a8909",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2021-10-09T18:36:00Z",
    "title_canon_sha256": "a7b3bfcc264af0f8477ff69f4872107cbe3567a6ffb64ac00190bf727d3d480c"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2110.04627",
    "kind": "arxiv",
    "version": 3
  }
}