{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:M56RT635O3UJWLV6ECFPXRJDOT","short_pith_number":"pith:M56RT635","schema_version":"1.0","canonical_sha256":"677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738","source":{"kind":"arxiv","id":"2311.07575","version":1},"attestation_state":"computed","paper":{"title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Chen Lin, Chris Liu, Han Qiu, Han Xiao, Hongsheng Li, Jiaming Han, Keqin Chen, Longtian Qiu, Peng Gao, Renrui Zhang, Siyuan Huang, Wenqi Shao, Xuming He, Yichi Zhang, Yu Qiao, Ziyi Lin","submitted_at":"2023-11-13T18:59:47Z","abstract_excerpt":"We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2311.07575","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2023-11-13T18:59:47Z","cross_cats_sorted":["cs.AI","cs.CL","cs.LG"],"title_canon_sha256":"264902f5b7ca56be994ab61c7b18762656d7555d64a3e668d98375fb3664e00b","abstract_canon_sha256":"d423f6009012c6e415551ba5b524f51d92dd05608cf7355693107cba48281c06"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:15.322497Z","signature_b64":"U9oBUu6Ptn0MC7adC8w2DZcykhGrTgX2KleI76/ltcDwi5gPohB4tGlXAmC4pl7utYIrajEIPtVvs6vwOOE8Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"677d19fb7d76e89b2ebe208afbc52374fbff19730c04592807ecbb5291149738","last_reissued_at":"2026-05-17T23:38:15.321821Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:15.321821Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SPHINX: The Joint Mixing of Weights, Tasks, and Visual Embeddings for Multi-modal Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.","cross_cats":["cs.AI","cs.CL","cs.LG"],"primary_cat":"cs.CV","authors_text":"Chen Lin, Chris Liu, Han Qiu, Han Xiao, Hongsheng Li, Jiaming Han, Keqin Chen, Longtian Qiu, Peng Gao, Renrui Zhang, Siyuan Huang, Wenqi Shao, Xuming He, Yichi Zhang, Yu Qiao, Ziyi Lin","submitted_at":"2023-11-13T18:59:47Z","abstract_excerpt":"We present SPHINX, a versatile multi-modal large language model (MLLM) with a joint mixing of model weights, tuning tasks, and visual embeddings. First, for stronger vision-language alignment, we unfreeze the large language model (LLM) during pre-training, and introduce a weight mix strategy between LLMs trained by real-world and synthetic data. By directly integrating the weights from two domains, the mixed LLM can efficiently incorporate diverse semantics with favorable robustness. Then, to enable multi-purpose capabilities, we mix a variety of tasks for joint visual instruction tuning, and "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that directly integrating weights from LLMs trained on real-world and synthetic data will efficiently incorporate diverse semantics with favorable robustness without introducing conflicts or degrading performance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"afe52fe8762e8dc4f201c84a3da32db1823ce490ae268e58a691dba7f7026e0e"},"source":{"id":"2311.07575","kind":"arxiv","version":1},"verdict":{"id":"82f64f73-d45f-4971-b152-2ea38f5c8154","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T02:58:20.956133Z","strongest_claim":"Based on our proposed joint mixing, SPHINX exhibits superior multi-modal understanding capabilities on a wide range of applications.","one_line_summary":"SPHINX improves multi-modal LLMs through joint mixing of weights, tasks, and visual embeddings from varied sources to achieve stronger alignment and multi-purpose capabilities.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that directly integrating weights from LLMs trained on real-world and synthetic data will efficiently incorporate diverse semantics with favorable robustness without introducing conflicts or degrading performance.","pith_extraction_headline":"Mixing weights from real-world and synthetic LLMs with varied tasks and visual embeddings produces a single versatile multi-modal model."},"references":{"count":45,"sample":[{"doi":"","year":null,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":1,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":1901,"title":"Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al","work_id":"c78cbfc8-8ead-4365-9b3c-098dabd131d4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning","work_id":"fb62cd1b-3991-40be-a987-3cfa5772b5b5","ref_index":3,"cited_arxiv_id":"2310.09478","is_internal_anchor":true},{"doi":"","year":null,"title":"InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning","work_id":"f3aac728-ded0-4e55-aa9e-4a1635d4313d","ref_index":4,"cited_arxiv_id":"2305.06500","is_internal_anchor":true},{"doi":"","year":null,"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","ref_index":5,"cited_arxiv_id":"1810.04805","is_internal_anchor":true}],"resolved_work":45,"snapshot_sha256":"0d11f8d387a29d6782b03abd127b414dbeca1775174c43280f8d00cb952cefba","internal_anchors":22},"formal_canon":{"evidence_count":2,"snapshot_sha256":"59ad857b0f1fae70083af3cc60fdce53ce3fa971177ccf32280f6a86c72f43b9"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2311.07575","created_at":"2026-05-17T23:38:15.321946+00:00"},{"alias_kind":"arxiv_version","alias_value":"2311.07575v1","created_at":"2026-05-17T23:38:15.321946+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2311.07575","created_at":"2026-05-17T23:38:15.321946+00:00"},{"alias_kind":"pith_short_12","alias_value":"M56RT635O3UJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"M56RT635O3UJWLV6","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"M56RT635","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2410.14702","citing_title":"Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2503.16549","citing_title":"MathFlow: Enhancing the Perceptual Flow of MLLMs for Visual Mathematical Problems","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09925","citing_title":"FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15300","citing_title":"Deep Pre-Alignment for VLMs","ref_index":160,"is_internal_anchor":true},{"citing_arxiv_id":"2510.21122","citing_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2407.03320","citing_title":"InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2403.00476","citing_title":"TempCompass: Do Video LLMs Really Understand Videos?","ref_index":102,"is_internal_anchor":true},{"citing_arxiv_id":"2403.14624","citing_title":"MathVerse: Does Your Multi-modal LLM Truly See the Diagrams in Visual Math Problems?","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2410.10594","citing_title":"VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2403.09611","citing_title":"MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13549","citing_title":"A Survey on Multimodal Large Language Models","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2404.14396","citing_title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2311.03079","citing_title":"CogVLM: Visual Expert for Pretrained Language Models","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2311.16502","citing_title":"MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2303.16199","citing_title":"LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention","ref_index":111,"is_internal_anchor":true},{"citing_arxiv_id":"2404.16821","citing_title":"How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24559","citing_title":"Aligned Multi-View Scripts for Universal Chart-to-Code Generation","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2506.01844","citing_title":"SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2404.18930","citing_title":"Hallucination of Multimodal Large Language Models: A Survey","ref_index":110,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10219","citing_title":"Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2407.07895","citing_title":"LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2306.13394","citing_title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18562","citing_title":"AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation","ref_index":191,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21079","citing_title":"Foveated Reasoning: Stateful, Action-based Visual Focusing for Vision-Language Models","ref_index":24,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT","json":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT.json","graph_json":"https://pith.science/api/pith-number/M56RT635O3UJWLV6ECFPXRJDOT/graph.json","events_json":"https://pith.science/api/pith-number/M56RT635O3UJWLV6ECFPXRJDOT/events.json","paper":"https://pith.science/paper/M56RT635"},"agent_actions":{"view_html":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT","download_json":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT.json","view_paper":"https://pith.science/paper/M56RT635","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2311.07575&json=true","fetch_graph":"https://pith.science/api/pith-number/M56RT635O3UJWLV6ECFPXRJDOT/graph.json","fetch_events":"https://pith.science/api/pith-number/M56RT635O3UJWLV6ECFPXRJDOT/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT/action/timestamp_anchor","attest_storage":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT/action/storage_attestation","attest_author":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT/action/author_attestation","sign_citation":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT/action/citation_signature","submit_replication":"https://pith.science/pith/M56RT635O3UJWLV6ECFPXRJDOT/action/replication_record"}},"created_at":"2026-05-17T23:38:15.321946+00:00","updated_at":"2026-05-17T23:38:15.321946+00:00"}