{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2021:5DCOF3VMGTZLMLR7G3WPGXB35H","short_pith_number":"pith:5DCOF3VM","schema_version":"1.0","canonical_sha256":"e8c4e2eeac34f2b62e3f36ecf35c3be9dba589951a95e029295327006e6a9849","source":{"kind":"arxiv","id":"2104.14294","version":2},"attestation_state":"computed","paper":{"title":"Emerging Properties in Self-Supervised Vision Transformers","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Self-supervised Vision Transformers encode explicit semantic segmentation information in their features.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Armand Joulin, Herv\\'e J\\'egou, Hugo Touvron, Ishan Misra, Julien Mairal, Mathilde Caron, Piotr Bojanowski","submitted_at":"2021-04-29T12:28:51Z","abstract_excerpt":"In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2104.14294","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2021-04-29T12:28:51Z","cross_cats_sorted":[],"title_canon_sha256":"6da5120c2f200ed38479654dc470ad04e74b90b6056c9f2193cbebf59211b2ef","abstract_canon_sha256":"e27e3a9f2795b58bc613cf7732e789f220f1f340ac0c88e34c09aebf18ca1c51"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.667059Z","signature_b64":"i9CHGKQKsjfwkf2Nw8c3vDaExRLbqn7OvH1Cie9JGimfOFZ4pLFjLxiMsZabiTq1oX4lYSyPX93D6FkkVbfUAA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e8c4e2eeac34f2b62e3f36ecf35c3be9dba589951a95e029295327006e6a9849","last_reissued_at":"2026-05-17T23:38:47.666608Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.666608Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Emerging Properties in Self-Supervised Vision Transformers","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Self-supervised Vision Transformers encode explicit semantic segmentation information in their features.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Armand Joulin, Herv\\'e J\\'egou, Hugo Touvron, Ishan Misra, Julien Mairal, Mathilde Caron, Piotr Bojanowski","submitted_at":"2021-04-29T12:28:51Z","abstract_excerpt":"In this paper, we question if self-supervised learning provides new properties to Vision Transformer (ViT) that stand out compared to convolutional networks (convnets). Beyond the fact that adapting self-supervised methods to this architecture works particularly well, we make the following observations: first, self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. Second, these features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT. "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets [...] achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that the observed semantic segmentation information and k-NN performance arise specifically from the interaction of self-supervision with the ViT architecture rather than from particular hyperparameter choices, dataset statistics, or evaluation protocols.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Self-supervised Vision Transformers encode explicit semantic segmentation information in their features.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"e7744828491c16518e765229c1e6b7f357b45db74bab277450dec402b8d5e640"},"source":{"id":"2104.14294","kind":"arxiv","version":2},"verdict":{"id":"558d59f9-ecae-48a5-a6aa-d48188c806ed","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T13:59:46.222058Z","strongest_claim":"self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets [...] achieving 80.1% top-1 on ImageNet in linear evaluation with ViT-Base.","one_line_summary":"Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that the observed semantic segmentation information and k-NN performance arise specifically from the interaction of self-supervision with the ViT architecture rather than from particular hyperparameter choices, dataset statistics, or evaluation protocols.","pith_extraction_headline":"Self-supervised Vision Transformers encode explicit semantic segmentation information in their features."},"references":{"count":85,"sample":[{"doi":"","year":2018,"title":"arXiv preprint arXiv:1804.03235 , year=","work_id":"eaf8922f-efbd-41b1-b29e-d4589ef7f01d","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Self-labelling via simultaneous clustering and repre- sentation learning","work_id":"fcdc83a2-74ca-44ac-90cc-81fc3220fdba","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2006,"title":"preprint arXiv:2006.10803 , year=","work_id":"ff95292e-cd8e-4c2c-869c-f8c046cf5db7","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2014,"title":"Neural Machine Translation by Jointly Learning to Align and Translate","work_id":"d831e763-d530-4029-a65c-ac595d82cb2a","ref_index":4,"cited_arxiv_id":"1409.0473","is_internal_anchor":true},{"doi":"","year":1902,"title":"MultiGrain : a unified image embedding for classes and instances","work_id":"54472c12-9ee6-4a11-86b6-d57d3cfa0459","ref_index":5,"cited_arxiv_id":"1902.05509","is_internal_anchor":true}],"resolved_work":85,"snapshot_sha256":"a05ec25062deda5e9f77a8fe372fd781435d2a2bc718348de1151b8b3494e3d1","internal_anchors":17},"formal_canon":{"evidence_count":2,"snapshot_sha256":"13991ee97d01c10630ca894aec8db810cdc4dc60ddc1a4ca638ecc81a41bd46c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2104.14294","created_at":"2026-05-17T23:38:47.666673+00:00"},{"alias_kind":"arxiv_version","alias_value":"2104.14294v2","created_at":"2026-05-17T23:38:47.666673+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2104.14294","created_at":"2026-05-17T23:38:47.666673+00:00"},{"alias_kind":"pith_short_12","alias_value":"5DCOF3VMGTZL","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"5DCOF3VMGTZLMLR7","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"5DCOF3VM","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":34,"internal_anchor_count":34,"sample":[{"citing_arxiv_id":"2602.05126","citing_title":"CLEAR-HPV: Interpretable concept discovery for human-papillomavirus-associated morphology in whole-slide histology","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2503.20314","citing_title":"Wan: Open and Advanced Large-Scale Video Generative Models","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2602.13294","citing_title":"VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22098","citing_title":"TextTeacher: What Can Language Teach About Images?","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2510.16416","citing_title":"SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2602.05126","citing_title":"CLEAR-HPV: Interpretable concept discovery for human-papillomavirus-associated morphology in whole-slide histology","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17630","citing_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20941","citing_title":"PaintCopilot: Modeling Painting as Autonomous Artistic Continuation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17630","citing_title":"SegRAG: Training-Free Retrieval-Augmented Semantic Segmentation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17671","citing_title":"PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17530","citing_title":"Few-Shot Network Intrusion Detection Using Online Triplet Mining","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16775","citing_title":"VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2411.04983","citing_title":"DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2512.04884","citing_title":"Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation","ref_index":3,"is_internal_anchor":true},{"citing_arxiv_id":"2110.04627","citing_title":"Vector-quantized Image Modeling with Improved VQGAN","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2208.03299","citing_title":"Atlas: Few-shot Learning with Retrieval Augmented Language Models","ref_index":126,"is_internal_anchor":true},{"citing_arxiv_id":"2602.06912","citing_title":"PANC: Prior-Aware Normalized Cut via Anchor-Augmented Token Graphs","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2512.15692","citing_title":"mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08078","citing_title":"Normalizing Trajectory Models","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04133","citing_title":"Learning Robust Visual Features in Computed Tomography Enables Efficient Transfer Learning for Clinical Tasks","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2106.08254","citing_title":"BEiT: BERT Pre-Training of Image Transformers","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12112","citing_title":"When Policy Entropy Constraint Fails: Preserving Diversity in Flow-based RLHF via Perceptual Entropy","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2112.09118","citing_title":"Unsupervised Dense Information Retrieval with Contrastive Learning","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2404.08471","citing_title":"Revisiting Feature Prediction for Learning Visual Representations from Video","ref_index":227,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03669","citing_title":"FUS3DMaps: Scalable and Accurate Open-Vocabulary Semantic Mapping by 3D Fusion of Voxel- and Instance-Level Layers","ref_index":22,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H","json":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H.json","graph_json":"https://pith.science/api/pith-number/5DCOF3VMGTZLMLR7G3WPGXB35H/graph.json","events_json":"https://pith.science/api/pith-number/5DCOF3VMGTZLMLR7G3WPGXB35H/events.json","paper":"https://pith.science/paper/5DCOF3VM"},"agent_actions":{"view_html":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H","download_json":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H.json","view_paper":"https://pith.science/paper/5DCOF3VM","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2104.14294&json=true","fetch_graph":"https://pith.science/api/pith-number/5DCOF3VMGTZLMLR7G3WPGXB35H/graph.json","fetch_events":"https://pith.science/api/pith-number/5DCOF3VMGTZLMLR7G3WPGXB35H/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H/action/timestamp_anchor","attest_storage":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H/action/storage_attestation","attest_author":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H/action/author_attestation","sign_citation":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H/action/citation_signature","submit_replication":"https://pith.science/pith/5DCOF3VMGTZLMLR7G3WPGXB35H/action/replication_record"}},"created_at":"2026-05-17T23:38:47.666673+00:00","updated_at":"2026-05-17T23:38:47.666673+00:00"}