{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2021:VXBTKPXPWYONVRBHNNR45ZQN47","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"3863d01106f9d601de72bd6cc5b11369b05e213b307e2cfd244a112f89d7f2ee","cross_cats_sorted":["cs.LG"],"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-03-25T17:59:31Z","title_canon_sha256":"429e2f6262f5bcd328a152a58f6dfa33854bfdd711788f8099511a4dd20e776f"},"schema_version":"1.0","source":{"id":"2103.14030","kind":"arxiv","version":2}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2103.14030","created_at":"2026-05-17T23:38:50Z"},{"alias_kind":"arxiv_version","alias_value":"2103.14030v2","created_at":"2026-05-17T23:38:50Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2103.14030","created_at":"2026-05-17T23:38:50Z"},{"alias_kind":"pith_short_12","alias_value":"VXBTKPXPWYON","created_at":"2026-05-18T12:33:33Z"},{"alias_kind":"pith_short_16","alias_value":"VXBTKPXPWYONVRBH","created_at":"2026-05-18T12:33:33Z"},{"alias_kind":"pith_short_8","alias_value":"VXBTKPXP","created_at":"2026-05-18T12:33:33Z"}],"graph_snapshots":[{"event_id":"sha256:52fe77d42ce2f6431641f908bcaa1b85c96fba04f69d18416ee6b814df00d329","target":"graph","created_at":"2026-05-17T23:38:50Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"The assumption that the fixed window size and shift pattern chosen for ImageNet will transfer without major retuning to detection and segmentation heads on COCO and ADE20K; the paper reports strong numbers but does not isolate how much of the gain comes from the backbone versus from the detection/segmentation heads."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity."}],"snapshot_sha256":"8ce4ba11974a009e6b46f3e966e8bdb2e0fdac92b4d6925b9d4e450df60b6522"},"formal_canon":{"evidence_count":1,"snapshot_sha256":"34c1e1fb0ae7e38ed8115d8c5ced3b687f4ceac98cb16440756bca1c76e464af"},"paper":{"abstract_excerpt":"This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \\textbf{S}hifted \\textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention ","authors_text":"Baining Guo, Han Hu, Stephen Lin, Yixuan Wei, Yue Cao, Yutong Lin, Ze Liu, Zheng Zhang","cross_cats":["cs.LG"],"headline":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.","license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-03-25T17:59:31Z","title":"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows"},"references":{"count":85,"internal_anchors":4,"resolved_work":85,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training","work_id":"0f82b4e9-23cd-423f-9179-7358c807d849","year":2020},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":2,"title":"Toward transformer-based object detection","work_id":"962eb65e-1b81-4d5e-a137-4f0d69e44485","year":2012},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":3,"title":"Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3","work_id":"7db2c0f3-731f-4366-8c07-f67ec978ce38","year":2020},{"cited_arxiv_id":"2004.10934","doi":"","is_internal_anchor":true,"ref_index":4,"title":"YOLOv4: Optimal Speed and Accuracy of Object Detection","work_id":"7057aaee-27f6-4209-a83c-f59727f937a8","year":2004},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":5,"title":"Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (","work_id":"0eae2b0c-7c37-4b17-a762-9010b513baec","year":2017}],"snapshot_sha256":"68692fa97329b726adc73a03eac3b3f9717cb7014b29dc88d80ab5395e18c809"},"source":{"id":"2103.14030","kind":"arxiv","version":2},"verdict":{"created_at":"2026-05-15T19:24:34.375534Z","id":"8b340a03-6067-42ea-ba67-24dbf4b5c28d","model_set":{"reader":"grok-4.3"},"one_line_summary":"Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.","strongest_claim":"Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.","weakest_assumption":"The assumption that the fixed window size and shift pattern chosen for ImageNet will transfer without major retuning to detection and segmentation heads on COCO and ADE20K; the paper reports strong numbers but does not isolate how much of the gain comes from the backbone versus from the detection/segmentation heads."}},"verdict_id":"8b340a03-6067-42ea-ba67-24dbf4b5c28d"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:23cc7356dcfe4f698a03bdb820c64b3ff914722a05be972acfcaea07b9217429","target":"record","created_at":"2026-05-17T23:38:50Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"3863d01106f9d601de72bd6cc5b11369b05e213b307e2cfd244a112f89d7f2ee","cross_cats_sorted":["cs.LG"],"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-03-25T17:59:31Z","title_canon_sha256":"429e2f6262f5bcd328a152a58f6dfa33854bfdd711788f8099511a4dd20e776f"},"schema_version":"1.0","source":{"id":"2103.14030","kind":"arxiv","version":2}},"canonical_sha256":"adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2","first_computed_at":"2026-05-17T23:38:50.426932Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-17T23:38:50.426932Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"JB31cB4964QSoE78A5dSJNCIKX0tWtR5pwdKHaX1GfMlEwm1xYvP7oULeZvO74pKeMJUbmajLCMk4ckHekxRDg==","signature_status":"signed_v1","signed_at":"2026-05-17T23:38:50.427376Z","signed_message":"canonical_sha256_bytes"},"source_id":"2103.14030","source_kind":"arxiv","source_version":2}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:23cc7356dcfe4f698a03bdb820c64b3ff914722a05be972acfcaea07b9217429","sha256:52fe77d42ce2f6431641f908bcaa1b85c96fba04f69d18416ee6b814df00d329"],"state_sha256":"245c9cfaedad1b7995a499e60deae0df472c0b5c006b9d0876218e23a634ae52"}