pith. sign in
Pith Number

pith:M56RT635

pith:2023:M56RT635O3UJWLV6ECFPXRJDOT
not attested not anchored not stored refs resolved

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Chen Lin, Chris Liu, Han Qiu, Han Xiao, Hongsheng Li, Jiaming Han, Keqin Chen, Longtian Qiu, Peng Gao, Renrui Zhang, Siyuan Huang, Wenqi Shao, Xuming He, Yichi Zhang, Yu Qiao, Ziyi Lin

Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.

arxiv:2311.07575 v1 · 2023-11-13 · cs.CV · cs.AI · cs.CL · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{M56RT635O3UJWLV6ECFPXRJDOT}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications.

C2weakest assumption

The assumption that directly integrating weights from LLMs trained on real-world and synthetic data will efficiently incorporate diverse semantics with favorable robustness without introducing conflicts or degrading performance.

C3one line summary

SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.

References

45 extracted · 45 resolved · 22 Pith anchors

[1] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv:2308.12966
[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al 1901
[3] MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning 2023 · arXiv:2310.09478
[4] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning · arXiv:2305.06500
[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · arXiv:1810.04805

Formal links

2 machine-checked theorem links

Cited by

24 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.321821Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738

Aliases

arxiv: 2311.07575 · arxiv_version: 2311.07575v1 · doi: 10.48550/arxiv.2311.07575 · pith_short_12: M56RT635O3UJ · pith_short_16: M56RT635O3UJWLV6 · pith_short_8: M56RT635
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "d423f6009012c6e415551ba5b524f51d92dd05608cf7355693107cba48281c06",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-13T18:59:47Z",
    "title_canon_sha256": "264902f5b7ca56be994ab61c7b18762656d7555d64a3e668d98375fb3664e00b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.07575",
    "kind": "arxiv",
    "version": 1
  }
}