pith. sign in
Pith Number

pith:VXBTKPXP

pith:2021:VXBTKPXPWYONVRBHNNR45ZQN47
not attested not anchored not stored refs resolved

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Baining Guo, Han Hu, Stephen Lin, Yixuan Wei, Yue Cao, Yutong Lin, Ze Liu, Zheng Zhang

Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.

arxiv:2103.14030 v2 · 2021-03-25 · cs.CV · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VXBTKPXPWYONVRBHNNR45ZQN47}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.

C2weakest assumption

The assumption that the fixed window size and shift pattern chosen for ImageNet will transfer without major retuning to detection and segmentation heads on COCO and ADE20K; the paper reports strong numbers but does not isolate how much of the gain comes from the backbone versus from the detection/segmentation heads.

C3one line summary

Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.

References

85 extracted · 85 resolved · 4 Pith anchors

[1] Unilmv2: Pseudo-masked language models for unified language model pre-training 2020
[2] Toward transformer-based object detection 2012
[3] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3 2020
[4] YOLOv4: Optimal Speed and Accuracy of Object Detection 2004 · arXiv:2004.10934
[5] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision ( 2017

Formal links

1 machine-checked theorem link

Cited by

27 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.426932Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2

Aliases

arxiv: 2103.14030 · arxiv_version: 2103.14030v2 · doi: 10.48550/arxiv.2103.14030 · pith_short_12: VXBTKPXPWYON · pith_short_16: VXBTKPXPWYONVRBH · pith_short_8: VXBTKPXP
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "3863d01106f9d601de72bd6cc5b11369b05e213b307e2cfd244a112f89d7f2ee",
    "cross_cats_sorted": [
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2021-03-25T17:59:31Z",
    "title_canon_sha256": "429e2f6262f5bcd328a152a58f6dfa33854bfdd711788f8099511a4dd20e776f"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2103.14030",
    "kind": "arxiv",
    "version": 2
  }
}