pith. the verified trust layer for science. sign in
Pith Number

pith:YAOHT3YA

pith:2024:YAOHT3YAN4BJN2T6TGUDENXI6V
not attested not anchored not stored refs resolved

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model

Bin Wang, Conghui He, Dahua Lin, Hang Yan, Haodong Duan, Jiaqi Wang, Jingwen Li, Kai Chen, Linke Ouyang, Maosong Cao, Pan Zhang, Songyang Zhang, Wei Li, Wenwei Zhang, Xiaoyi Dong, Xilin Wei, Xingcheng Zhang, Xinyue Zhang, Yang Gao, Yining Li, Yuhang Cao, Yuhang Zang, Yu Qiao

InternLM-XComposer2 generates custom interleaved text-image content by applying LoRA parameters only to image tokens.

arxiv:2401.16420 v1 · 2024-01-29 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YAOHT3YAN4BJN2T6TGUDENXI6V}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

InternLM-XComposer2 ... not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments.

C2weakest assumption

That applying additional LoRA parameters exclusively to image tokens preserves the integrity of pre-trained language knowledge while enabling precise vision understanding and high-quality text composition.

C3one line summary

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.

References

105 extracted · 105 resolved · 16 Pith anchors

[1] Nocaps: Novel object captioning at scale
[2] Flamingo: a visual language model for few-shot learning,
[3] arXiv preprint arXiv:1905.13319 , year= 1905 · arXiv:1905.13319
[4] Lawrence Zitnick, and Devi Parikh 2015
[5] Openflamingo: An open- source framework for training large autoregressive vision- language models 2023

Formal links

2 machine-checked theorem links

Cited by

21 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.981310Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c01c79ef006f0296ea7e99a83236e8f5749d05e79699e4e0c8abe238914c4934

Aliases

arxiv: 2401.16420 · arxiv_version: 2401.16420v1 · doi: 10.48550/arxiv.2401.16420 · pith_short_12: YAOHT3YAN4BJ · pith_short_16: YAOHT3YAN4BJN2T6 · pith_short_8: YAOHT3YA
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YAOHT3YAN4BJN2T6TGUDENXI6V \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c01c79ef006f0296ea7e99a83236e8f5749d05e79699e4e0c8abe238914c4934
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "8429ed639989a4121da3104fc4fd2393bc12545d4777baa9397279c0f2651057",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-01-29T18:59:02Z",
    "title_canon_sha256": "3b2deff91597c496b7dbbec7f1d2f0eaef3a13ef574dfa65915194c7ee757aa0"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2401.16420",
    "kind": "arxiv",
    "version": 1
  }
}