pith. sign in
Pith Number

pith:CTVKNTPZ

pith:2023:CTVKNTPZG7G2XIRXBJVDSVT54A
not attested not anchored not stored refs resolved

CogVLM: Visual Expert for Pretrained Language Models

Bin Xu, Jiazheng Xu, Jie Tang, Ji Qi, Juanzi Li, Junhui Ji, Lei Zhao, Ming Ding, Qingsong Lv, Weihan Wang, Wenmeng Yu, Wenyi Hong, Xixuan Song, Yan Wang, Yuxiao Dong, Zhuoyi Yang

A trainable visual expert module inserted into the attention and FFN layers of a frozen language model enables deep vision-language fusion.

arxiv:2311.03079 v2 · 2023-11-06 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{CTVKNTPZG7G2XIRXBJVDSVT54A}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

CogVLM-17B achieves state-of-the-art performance on 10 classic cross-modal benchmarks... surpassing or matching PaLI-X 55B.

C2weakest assumption

The visual expert module can be inserted into the attention and FFN layers of any frozen pretrained language model without requiring changes to the original architecture or loss functions.

C3one line summary

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.

References

33 extracted · 33 resolved · 17 Pith anchors

[1] OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models · arXiv:2308.01390
[2] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv:2308.12966
[3] Murel: Multimodal relational reasoning for visual ques- tion answering 1989
[4] Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic · arXiv:2306.15195
[5] Universal captioner: Long-tail vision-and-language model training through content-style separation.arXiv preprint arXiv:2111.12727,

Formal links

2 machine-checked theorem links

Cited by

45 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:51.021764Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

14eaa6cdf937cdaba2370a6a39567de015ee54eca0c505143d4d420dfa34f0e5

Aliases

arxiv: 2311.03079 · arxiv_version: 2311.03079v2 · doi: 10.48550/arxiv.2311.03079 · pith_short_12: CTVKNTPZG7G2 · pith_short_16: CTVKNTPZG7G2XIRX · pith_short_8: CTVKNTPZ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/CTVKNTPZG7G2XIRXBJVDSVT54A \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 14eaa6cdf937cdaba2370a6a39567de015ee54eca0c505143d4d420dfa34f0e5
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "9ed531cb4a2ee62bd4512e8535ec68ef02bf4f67385e61fe8e221a00b5f126b6",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-06T13:04:39Z",
    "title_canon_sha256": "679fe85268225460d07d2179c1c3c8b521429885cfbc6b874c9f34e37b4130b4"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.03079",
    "kind": "arxiv",
    "version": 2
  }
}