pith. sign in
Pith Number

pith:L5DDZ2QZ

pith:2025:L5DDZ2QZIV2FS2B7UNNWJLPIKJ
not attested not anchored not stored refs resolved

Perception Encoder: The best visual embeddings are not at the output of the network

Andrea Madotto, Chen Wei, Christoph Feichtenhofer, Daniel Bolya, Daniel Li, Hanoona Rasheed, Hu Xu, Jang Hyun Cho, Jathushan Rajasegaran, Jiale Zhi, Junke Wang, Marco Monteiro, Nikhila Ravi, Peize Sun, Piotr Doll\'ar, Po-Yao Huang, Shiyu Dong, Tengyu Ma

The best visual embeddings for images and videos come from intermediate layers of a contrastively trained network rather than its final output.

arxiv:2504.13181 v2 · 2025-04-17 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{L5DDZ2QZIV2FS2B7UNNWJLPIKJ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

after scaling our carefully tuned image pretraining recipe and refining with our robust video data engine, we find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks. There is only one caveat: these embeddings are hidden within the intermediate layers of the network.

C2weakest assumption

That the intermediate-layer embeddings remain superior after the two alignment procedures without post-hoc data selection or task-specific hyperparameter tuning that would undermine the claim of a single general pretraining recipe.

C3one line summary

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.

References

169 extracted · 169 resolved · 20 Pith anchors

[1] Nocaps: Novel object captioning at scale 2019
[2] Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Baptiste Bout, Devendra Chaplot, Jessica Chudnovsky, Diogo Costa, Baudouin De Monicault, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, P 2024 · arXiv:2410.07073
[3] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond 2023 · arXiv:2308.12966
[4] ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models 2019
[5] PaliGemma: A versatile 3B VLM for transfer 2024 · arXiv:2407.07726

Formal links

2 machine-checked theorem links

Cited by

41 papers in Pith

Receipt and verification
First computed 2026-05-18T04:23:23.597930Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165

Aliases

arxiv: 2504.13181 · arxiv_version: 2504.13181v2 · doi: 10.48550/arxiv.2504.13181 · pith_short_12: L5DDZ2QZIV2F · pith_short_16: L5DDZ2QZIV2FS2B7 · pith_short_8: L5DDZ2QZ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/L5DDZ2QZIV2FS2B7UNNWJLPIKJ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5f463cea19457459683fa35b64ade85279a5e94f291864f9f7ba95e465291165
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "03ab2b855c72230580f5d0a2e514a039a1a78cf6641bf2d4636445f57968f8e2",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-04-17T17:59:57Z",
    "title_canon_sha256": "451081b8c383b7d3d716be07c800834b1ca1e73ae180cef305cbd11d15d32e78"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.13181",
    "kind": "arxiv",
    "version": 2
  }
}