{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:VXBTKPXPWYONVRBHNNR45ZQN47","short_pith_number":"pith:VXBTKPXP","schema_version":"1.0","canonical_sha256":"adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2","source":{"kind":"arxiv","id":"2103.14030","version":2},"attestation_state":"computed","paper":{"title":"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Baining Guo, Han Hu, Stephen Lin, Yixuan Wei, Yue Cao, Yutong Lin, Ze Liu, Zheng Zhang","submitted_at":"2021-03-25T17:59:31Z","abstract_excerpt":"This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \\textbf{S}hifted \\textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2103.14030","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-03-25T17:59:31Z","cross_cats_sorted":["cs.LG"],"title_canon_sha256":"429e2f6262f5bcd328a152a58f6dfa33854bfdd711788f8099511a4dd20e776f","abstract_canon_sha256":"3863d01106f9d601de72bd6cc5b11369b05e213b307e2cfd244a112f89d7f2ee"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.427376Z","signature_b64":"JB31cB4964QSoE78A5dSJNCIKX0tWtR5pwdKHaX1GfMlEwm1xYvP7oULeZvO74pKeMJUbmajLCMk4ckHekxRDg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"adc3353eefb61cdac4276b63cee60de7fd510f54088b109eb528218d0dc5d1e2","last_reissued_at":"2026-05-17T23:38:50.426932Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.426932Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Swin Transformer: Hierarchical Vision Transformer using Shifted Windows","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.","cross_cats":["cs.LG"],"primary_cat":"cs.CV","authors_text":"Baining Guo, Han Hu, Stephen Lin, Yixuan Wei, Yue Cao, Yutong Lin, Ze Liu, Zheng Zhang","submitted_at":"2021-03-25T17:59:31Z","abstract_excerpt":"This paper presents a new vision Transformer, called Swin Transformer, that capably serves as a general-purpose backbone for computer vision. Challenges in adapting Transformer from language to vision arise from differences between the two domains, such as large variations in the scale of visual entities and the high resolution of pixels in images compared to words in text. To address these differences, we propose a hierarchical Transformer whose representation is computed with \\textbf{S}hifted \\textbf{win}dows. The shifted windowing scheme brings greater efficiency by limiting self-attention "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the fixed window size and shift pattern chosen for ImageNet will transfer without major retuning to detection and segmentation heads on COCO and ADE20K; the paper reports strong numbers but does not isolate how much of the gain comes from the backbone versus from the detection/segmentation heads.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8ce4ba11974a009e6b46f3e966e8bdb2e0fdac92b4d6925b9d4e450df60b6522"},"source":{"id":"2103.14030","kind":"arxiv","version":2},"verdict":{"id":"8b340a03-6067-42ea-ba67-24dbf4b5c28d","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T19:24:34.375534Z","strongest_claim":"Its performance surpasses the previous state-of-the-art by a large margin of +2.7 box AP and +2.6 mask AP on COCO, and +3.2 mIoU on ADE20K, demonstrating the potential of Transformer-based models as vision backbones.","one_line_summary":"Swin Transformer reaches 87.3% ImageNet accuracy and sets new records on COCO detection and ADE20K segmentation by replacing global self-attention with shifted-window local attention inside a hierarchical pyramid.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the fixed window size and shift pattern chosen for ImageNet will transfer without major retuning to detection and segmentation heads on COCO and ADE20K; the paper reports strong numbers but does not isolate how much of the gain comes from the backbone versus from the detection/segmentation heads.","pith_extraction_headline":"Swin Transformer uses shifted windows in a hierarchical structure to make vision Transformers efficient backbones with linear complexity."},"references":{"count":85,"sample":[{"doi":"","year":2020,"title":"Unilmv2: Pseudo-masked language models for uniﬁed language model pre-training","work_id":"0f82b4e9-23cd-423f-9179-7358c807d849","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2012,"title":"Toward transformer-based object detection","work_id":"962eb65e-1b81-4d5e-a137-4f0d69e44485","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V . Le. Attention augmented convolutional net- works, 2020. 3","work_id":"7db2c0f3-731f-4366-8c07-f67ec978ce38","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2004,"title":"YOLOv4: Optimal Speed and Accuracy of Object Detection","work_id":"7057aaee-27f6-4209-a83c-f59727f937a8","ref_index":4,"cited_arxiv_id":"2004.10934","is_internal_anchor":true},{"doi":"","year":2017,"title":"Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S. Davis. Soft-nms – improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision (","work_id":"0eae2b0c-7c37-4b17-a762-9010b513baec","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":85,"snapshot_sha256":"68692fa97329b726adc73a03eac3b3f9717cb7014b29dc88d80ab5395e18c809","internal_anchors":4},"formal_canon":{"evidence_count":1,"snapshot_sha256":"34c1e1fb0ae7e38ed8115d8c5ced3b687f4ceac98cb16440756bca1c76e464af"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2103.14030","created_at":"2026-05-17T23:38:50.426994+00:00"},{"alias_kind":"arxiv_version","alias_value":"2103.14030v2","created_at":"2026-05-17T23:38:50.426994+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2103.14030","created_at":"2026-05-17T23:38:50.426994+00:00"},{"alias_kind":"pith_short_12","alias_value":"VXBTKPXPWYON","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"VXBTKPXPWYONVRBH","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"VXBTKPXP","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2406.17323","citing_title":"XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14255","citing_title":"Architecture-Aware Explanation Auditing for Industrial Visual Inspection","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2110.02178","citing_title":"MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2506.02587","citing_title":"BEVCALIB: LiDAR-Camera Calibration via Geometry-Guided Bird's-Eye View Representations","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2512.17111","citing_title":"Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2601.17636","citing_title":"HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts","ref_index":61,"is_internal_anchor":true},{"citing_arxiv_id":"2308.13418","citing_title":"Nougat: Neural Optical Understanding for Academic Documents","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2602.23024","citing_title":"InCoM: Intent-Driven Perception and Structured Coordination for Mobile Manipulation","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2312.17090","citing_title":"Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels","ref_index":142,"is_internal_anchor":true},{"citing_arxiv_id":"2603.13941","citing_title":"Bidirectional Cross-Attention Fusion of High-Res RGB and Low-Res HSI for Multimodal Automated Waste Sorting","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14255","citing_title":"Architecture-Aware Explanation Auditing for Industrial Visual Inspection","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13343","citing_title":"Hierarchical Transformer Preconditioning for Interactive Physics Simulation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12855","citing_title":"Prediction of Rectal Cancer Regrowth from Longitudinal Endoscopy","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13343","citing_title":"Hierarchical Transformer Preconditioning for Interactive Physics Simulation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2111.07832","citing_title":"iBOT: Image BERT Pre-Training with Online Tokenizer","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04086","citing_title":"LAA-X: Unified Localized Artifact Attention for Quality-Agnostic and Generalizable Face Forgery Detection","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2106.08254","citing_title":"BEiT: BERT Pre-Training of Image Transformers","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2107.08430","citing_title":"YOLOX: Exceeding YOLO Series in 2021","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12026","citing_title":"Spectral Vision Transformer for Efficient Tokenization with Limited Data","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26869","citing_title":"KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24347","citing_title":"Semantic Segmentation for Histopathology using Learned Regularization based on Global Proportions","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2303.05499","citing_title":"Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2604.09034","citing_title":"The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2604.16987","citing_title":"DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17472","citing_title":"UniMesh: Unifying 3D Mesh Understanding and Generation","ref_index":26,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47","json":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47.json","graph_json":"https://pith.science/api/pith-number/VXBTKPXPWYONVRBHNNR45ZQN47/graph.json","events_json":"https://pith.science/api/pith-number/VXBTKPXPWYONVRBHNNR45ZQN47/events.json","paper":"https://pith.science/paper/VXBTKPXP"},"agent_actions":{"view_html":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47","download_json":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47.json","view_paper":"https://pith.science/paper/VXBTKPXP","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2103.14030&json=true","fetch_graph":"https://pith.science/api/pith-number/VXBTKPXPWYONVRBHNNR45ZQN47/graph.json","fetch_events":"https://pith.science/api/pith-number/VXBTKPXPWYONVRBHNNR45ZQN47/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47/action/timestamp_anchor","attest_storage":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47/action/storage_attestation","attest_author":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47/action/author_attestation","sign_citation":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47/action/citation_signature","submit_replication":"https://pith.science/pith/VXBTKPXPWYONVRBHNNR45ZQN47/action/replication_record"}},"created_at":"2026-05-17T23:38:50.426994+00:00","updated_at":"2026-05-17T23:38:50.426994+00:00"}