pith. sign in
Pith Number

pith:T3R2WYDI

pith:2024:T3R2WYDILTDHFB22LT4M6D44D3
not attested not anchored not stored refs resolved

InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

Bin Wang, Conghui He, Dahua Lin, Hang Yan, Haodong Duan, Jiaqi Wang, Jifeng Dai, Jingwen Li, Kai Chen, Lin Chen, Linke Ouyang, Pan Zhang, Peng Sun, Qipeng Guo, Rui Qian, Songyang Zhang, Wei Li, Wenhai Wang, Wenwei Zhang, Xiaoyi Dong, Xingcheng Zhang, Xinyue Zhang, Yang Gao, Yining Li, Yuhang Cao, Yuhang Zang, Yu Qiao

InternLM-XComposer-2.5 reaches GPT-4V level on vision-language tasks with a 7B model and 96K context support.

arxiv:2407.03320 v1 · 2024-07-03 · cs.CV · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{T3R2WYDILTDHFB22LT4M6D44D3}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend... outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks.

C2weakest assumption

That the 28 chosen benchmarks and the specific 16 key tasks are representative of real-world use and that RoPE extrapolation from 24K training to 96K inference does not introduce hidden degradation on long outputs.

C3one line summary

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

References

183 extracted · 183 resolved · 36 Pith anchors

[1] Nocaps: Novel object captioning at scale 2019
[2] Flamingo: a visual language model for few-shot learning,
[3] Claude 3 haiku: our fastest model yet,
[4] Available at: https://www.anthropic.com/ news/claude-3-haiku. 1, 8
[5] Lawrence Zitnick, and Devi Parikh 2015

Formal links

2 machine-checked theorem links

Cited by

20 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.327329Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9ee3ab60685cc672875a5cf8cf0f9c1ec15b3f02177cf550807d3b7ab251300e

Aliases

arxiv: 2407.03320 · arxiv_version: 2407.03320v1 · doi: 10.48550/arxiv.2407.03320 · pith_short_12: T3R2WYDILTDH · pith_short_16: T3R2WYDILTDHFB22 · pith_short_8: T3R2WYDI
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/T3R2WYDILTDHFB22LT4M6D44D3 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9ee3ab60685cc672875a5cf8cf0f9c1ec15b3f02177cf550807d3b7ab251300e
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "21cceb9d462163087b0dca8e7bb289e0afc7fcd632313d0b62ce244763f889b9",
    "cross_cats_sorted": [
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-07-03T17:59:21Z",
    "title_canon_sha256": "38e695c3ae3d470f400cb2e8ab0933bd36b3e26713f77856af17cbb4736facd1"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2407.03320",
    "kind": "arxiv",
    "version": 1
  }
}