pith:TTNY2ODJ
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa jointly trains contrastive and captioning losses in one encoder-decoder to create image-text foundation models that reach new state-of-the-art on ImageNet and multimodal tasks.
arxiv:2205.01917 v2 · 2022-05-04 · cs.CV · cs.LG · cs.MM
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{TTNY2ODJEGWATHHS3VDQJ3QXF2}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
CoCa obtains 86.3% zero-shot top-1 accuracy on ImageNet, 90.6% with a frozen encoder and learned classification head, and new state-of-the-art 91.0% top-1 accuracy on ImageNet with a finetuned encoder, while also leading on Kinetics, MSCOCO, VQA, and other tasks.
That omitting cross-attention in the first half of the decoder layers cleanly separates unimodal text representations from multimodal ones without harming overall capacity or optimization stability.
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:52.732937Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
9cdb8d386921ac099cf2dd4704ee172ea655464ac80816bd7db093177d342562
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/TTNY2ODJEGWATHHS3VDQJ3QXF2 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9cdb8d386921ac099cf2dd4704ee172ea655464ac80816bd7db093177d342562
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "33c64576bd99d3220fac09e2275b95093c5fe2928a27639a33ad2c3ad2b5f166",
"cross_cats_sorted": [
"cs.LG",
"cs.MM"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CV",
"submitted_at": "2022-05-04T07:01:14Z",
"title_canon_sha256": "979368834a9c1c63162fa1ef78acefae237399dabc836024cbfc08aa7f9b530b"
},
"schema_version": "1.0",
"source": {
"id": "2205.01917",
"kind": "arxiv",
"version": 2
}
}