pith. sign in
Pith Number

pith:ROYX62NC

pith:2022:ROYX62NCB72JQHDVNJK6MR3MCN
not attested not anchored not stored refs resolved

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Adam Grycner, AJ Piergiovanni, Alexander Kolesnikov, Andreas Steiner, Anelia Angelova, Ashish Thapliyal, Basil Mustafa, Burcu Karagol Ayan, Carlos Riquelme, Chao Jia, Daniel Salz, Gaurav Mishra, Hassan Akbari, James Bradbury, Joan Puigcerver, Keran Rong, Linting Xue, Lucas Beyer, Mojtaba Seyedhosseini, Nan Ding, Neil Houlsby, Piotr Padlewski, Radu Soricut, Sebastian Goodman, Soravit Changpinyo, Weicheng Kuo, Xiaohua Zhai, Xiao Wang, Xi Chen

PaLI jointly scales a 4-billion-parameter vision transformer with a language model on a 10B multilingual image-text set to reach state-of-the-art on captioning, VQA and scene-text tasks.

arxiv:2209.06794 v4 · 2022-09-14 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{ROYX62NCB72JQHDVNJK6MR3MCN}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

PaLI achieves state-of-the-art in multiple vision and language tasks (such as captioning, visual question-answering, scene-text understanding), while retaining a simple, modular, and scalable design.

C2weakest assumption

That joint scaling of the vision and language components on the new 10B multilingual dataset will produce the claimed performance gains without major issues from data quality, language imbalance, or overfitting.

C3one line summary

PaLI jointly scales a 4B-parameter vision transformer with language models on a new 10B multilingual image-text dataset to reach state-of-the-art results on vision-language tasks while keeping a simple modular design.

References

185 extracted · 185 resolved · 12 Pith anchors

[1] Tallyqa: Answering complex counting questions 2019
[2] nocaps : Novel object captioning at scale 2019
[3] Crossvqa: Scalably generating benchmarks for systematically testing vqa generalization 2021
[5] On the cross-lingual transferability of monolingual representations 2020
[6] ObjectNet : a large-scale bias-controlled dataset for pushing the limits of object recognition models 2019

Formal links

1 machine-checked theorem link

Cited by

34 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:48.354221Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

8bb17f69a20ff4981c756a55e6476c13441e2bbbd55d7f5c78db51a1e6549e0a

Aliases

arxiv: 2209.06794 · arxiv_version: 2209.06794v4 · doi: 10.48550/arxiv.2209.06794 · pith_short_12: ROYX62NCB72J · pith_short_16: ROYX62NCB72JQHDV · pith_short_8: ROYX62NC
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/ROYX62NCB72JQHDVNJK6MR3MCN \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8bb17f69a20ff4981c756a55e6476c13441e2bbbd55d7f5c78db51a1e6549e0a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2a50651767b6289fbf279711ac7379d502692af9f7b0932b728ccd5beb6987f9",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2022-09-14T17:24:07Z",
    "title_canon_sha256": "08a218ce080d71719c57e13038c8be63fb06bfe224ac17e3c8cb281586c15081"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2209.06794",
    "kind": "arxiv",
    "version": 4
  }
}