{"state_type":"pith_open_graph_state","state_version":"1.0","pith_number":"pith:2025:L5DDZ2QZIV2FS2B7UNNWJLPIKJ","merge_version":"pith-open-graph-merge-v1","event_count":2,"valid_event_count":2,"invalid_event_count":0,"equivocation_count":0,"current":{"canonical_record":{"metadata":{"abstract_canon_sha256":"03ab2b855c72230580f5d0a2e514a039a1a78cf6641bf2d4636445f57968f8e2","cross_cats_sorted":[],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57Z","title_canon_sha256":"451081b8c383b7d3d716be07c800834b1ca1e73ae180cef305cbd11d15d32e78"},"schema_version":"1.0","source":{"id":"2504.13181","kind":"arxiv","version":2}},"source_aliases":[{"alias_kind":"arxiv","alias_value":"2504.13181","created_at":"2026-05-18T04:23:23Z"},{"alias_kind":"arxiv_version","alias_value":"2504.13181v2","created_at":"2026-05-18T04:23:23Z"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.13181","created_at":"2026-05-18T04:23:23Z"},{"alias_kind":"pith_short_12","alias_value":"L5DDZ2QZIV2F","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_16","alias_value":"L5DDZ2QZIV2FS2B7","created_at":"2026-05-18T12:33:37Z"},{"alias_kind":"pith_short_8","alias_value":"L5DDZ2QZ","created_at":"2026-05-18T12:33:37Z"}],"graph_snapshots":[{"event_id":"sha256:82208317ea37e3800eb751bb148a2ca5f4ed21fd277091eb743dd7196bce27f9","target":"graph","created_at":"2026-05-18T04:23:23Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"graph_snapshot":{"author_claims":{"count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","strong_count":0},"builder_version":"pith-number-builder-2026-05-17-v1","claims":{"count":4,"items":[{"attestation":"unclaimed","claim_id":"C1","kind":"strongest_claim","source":"verdict.strongest_claim","status":"machine_extracted","text":"after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network."},{"attestation":"unclaimed","claim_id":"C2","kind":"weakest_assumption","source":"verdict.weakest_assumption","status":"machine_extracted","text":"That the intermediate-layer embeddings remain superior after the two alignment procedures without post-hoc data selection or task-specific hyperparameter tuning that would undermine the claim of a single general pretraining recipe."},{"attestation":"unclaimed","claim_id":"C3","kind":"one_line_summary","source":"verdict.one_line_summary","status":"machine_extracted","text":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment."},{"attestation":"unclaimed","claim_id":"C4","kind":"headline","source":"verdict.pith_extraction.headline","status":"machine_extracted","text":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output."}],"snapshot_sha256":"093cd69663750ce0aee377802b1bfc3b5bbbead4847c43bfc5ffb4d104da318d"},"formal_canon":{"evidence_count":2,"snapshot_sha256":"a71aca8f04904d580988c3689748b903e924d78d88b0b76a7db9bc0196e29351"},"paper":{"abstract_excerpt":"We introduce Perception Encoder (PE), a state-of-the-art vision encoder for image and video understanding trained via simple vision-language learning. Traditionally, vision encoders have relied on a variety of pretraining objectives, each tailored to specific downstream tasks such as classification, captioning, or localization. Surprisingly, after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one c","authors_text":"Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma","cross_cats":[],"headline":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57Z","title":"Perception Encoder: The best visual embeddings are not at the output of the network"},"references":{"count":169,"internal_anchors":20,"resolved_work":169,"sample":[{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":1,"title":"Nocaps: Novel object captioning at scale","work_id":"041edd2d-2995-46f1-a2f6-1b15274c0edf","year":2019},{"cited_arxiv_id":"2410.07073","doi":"","is_internal_anchor":true,"ref_index":2,"title":"Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, P","work_id":"9ad2b071-82d8-4cfa-b994-b9975094b575","year":2024},{"cited_arxiv_id":"2308.12966","doi":"","is_internal_anchor":true,"ref_index":3,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","year":2023},{"cited_arxiv_id":"","doi":"","is_internal_anchor":false,"ref_index":4,"title":"ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models","work_id":"073ccda3-4075-4094-b532-d808f9ecd0b4","year":2019},{"cited_arxiv_id":"2407.07726","doi":"","is_internal_anchor":true,"ref_index":5,"title":"PaliGemma: A versatile 3B VLM for transfer","work_id":"df6f48b3-5792-47c7-9614-cb856ea31ad9","year":2024}],"snapshot_sha256":"281e242da9f17678256c5f9a0aff02d3e8b2bd788438d8cb4c55d170b9f115db"},"source":{"id":"2504.13181","kind":"arxiv","version":2},"verdict":{"created_at":"2026-05-13T22:16:59.879451Z","id":"9a4b5b32-b4ac-433f-91cd-ace723ab191d","model_set":{"reader":"grok-4.3"},"one_line_summary":"Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.","pipeline_version":"pith-pipeline@v0.9.0","pith_extraction_headline":"The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.","strongest_claim":"after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network.","weakest_assumption":"That the intermediate-layer embeddings remain superior after the two alignment procedures without post-hoc data selection or task-specific hyperparameter tuning that would undermine the claim of a single general pretraining recipe."}},"verdict_id":"9a4b5b32-b4ac-433f-91cd-ace723ab191d"}}],"author_attestations":[],"timestamp_anchors":[],"storage_attestations":[],"citation_signatures":[],"replication_records":[],"corrections":[],"mirror_hints":[],"record_created":{"event_id":"sha256:2830a3583a5f82bea1fbfbd3eaaec89a37f7ee522c122c200dbc9be83b2a81fe","target":"record","created_at":"2026-05-18T04:23:23Z","signer":{"key_id":"pith-v1-2026-05","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","signer_id":"pith.science","signer_type":"pith_registry"},"payload":{"attestation_state":"computed","canonical_record":{"metadata":{"abstract_canon_sha256":"03ab2b855c72230580f5d0a2e514a039a1a78cf6641bf2d4636445f57968f8e2","cross_cats_sorted":[],"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-04-17T17:59:57Z","title_canon_sha256":"451081b8c383b7d3d716be07c800834b1ca1e73ae180cef305cbd11d15d32e78"},"schema_version":"1.0","source":{"id":"2504.13181","kind":"arxiv","version":2}},"canonical_sha256":"5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165","receipt":{"algorithm":"ed25519","builder_version":"pith-number-builder-2026-05-17-v1","canonical_sha256":"5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165","first_computed_at":"2026-05-18T04:23:23.597930Z","key_id":"pith-v1-2026-05","kind":"pith_receipt","last_reissued_at":"2026-05-18T04:23:23.597930Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54","receipt_version":"0.3","signature_b64":"n14HSx3OmKUc6X+sWQYyPTi8LbYIfO4cdvxdaSoDk6lsyXrDmWIUg1oNse1Y4scPiW69m6lywty+L8Raq1qWBQ==","signature_status":"signed_v1","signed_at":"2026-05-18T04:23:23.598441Z","signed_message":"canonical_sha256_bytes"},"source_id":"2504.13181","source_kind":"arxiv","source_version":2}}},"equivocations":[],"invalid_events":[],"applied_event_ids":["sha256:2830a3583a5f82bea1fbfbd3eaaec89a37f7ee522c122c200dbc9be83b2a81fe","sha256:82208317ea37e3800eb751bb148a2ca5f4ed21fd277091eb743dd7196bce27f9"],"state_sha256":"90e36ff00f5411ce35414b2f177cf27dbb7876d1d40441d68889ca604b7af90d"}