Pith Number

pith:M56RT635

pith:2023:M56RT635O3UJWLV6ECFPXRJDOT

not attested not anchored not stored refs resolved

SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models

Chen Lin, Chris Liu, Han Qiu, Han Xiao, Hongsheng Li, Jiaming Han, Keqin Chen, Longtian Qiu, Peng Gao, Renrui Zhang, Siyuan Huang, Wenqi Shao, Xuming He, Yichi Zhang, Yu Qiao, Ziyi Lin

Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.

arxiv:2311.07575 v1 · 2023-11-13 · cs.CV · cs.AI · cs.CL · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{M56RT635O3UJWLV6ECFPXRJDOT}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications.

C2weakest assumption

The assumption that directly integrating weights from LLMs trained on real-world and synthetic data will efficiently incorporate diverse semantics with favorable robustness without introducing conflicts or degrading performance.

C3one line summary

SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.

References

45 extracted · 45 resolved · 22 Pith anchors

[1] Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond · arXiv:2308.12966

[2] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al 1901

[3] MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning 2023 · arXiv:2310.09478

[4] InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning · arXiv:2305.06500

[5] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding · arXiv:1810.04805

Formal links

2 machine-checked theorem links

Cited by

24 papers in Pith

Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems

FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

Deep Pre-Alignment for VLMs

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

Receipt and verification

First computed	2026-05-17T23:38:15.321821Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738

Aliases

arxiv: 2311.07575 · arxiv_version: 2311.07575v1 · doi: 10.48550/arxiv.2311.07575 · pith_short_12: M56RT635O3UJ · pith_short_16: M56RT635O3UJWLV6 · pith_short_8: M56RT635

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "d423f6009012c6e415551ba5b524f51d92dd05608cf7355693107cba48281c06",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL",
      "cs.LG"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2023-11-13T18:59:47Z",
    "title_canon_sha256": "264902f5b7ca56be994ab61c7b18762656d7555d64a3e668d98375fb3664e00b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2311.07575",
    "kind": "arxiv",
    "version": 1
  }
}