{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:4HJNTCXO2WO2MWAA5L4UQHONPO","short_pith_number":"pith:4HJNTCXO","schema_version":"1.0","canonical_sha256":"e1d2d98aeed59da65800eaf9481dcd7bb07925b3977e9f4e3f155cb13bd9a37e","source":{"kind":"arxiv","id":"2404.14396","version":2},"attestation_state":"computed","paper":{"title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"SEED-X is a single multimodal model that comprehends arbitrary-sized images and generates at multiple levels of detail.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chen Li, Jinguo Zhu, Kun Yi, Lin Song, Sijie Zhao, Xiaohan Ding, Ying Shan, Yixiao Ge, Yuying Ge","submitted_at":"2024-04-22T17:56:09Z","abstract_excerpt":"The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We pr"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2404.14396","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-04-22T17:56:09Z","cross_cats_sorted":[],"title_canon_sha256":"aa793a029a6f4e2269cf1bf47ecffec1d1eae171983b0e7fb6412ef03ddb001a","abstract_canon_sha256":"9c00008e7c80a94f887a8c3de96e0de64c122983d342180947e64610740221f4"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.884214Z","signature_b64":"MNXL2IM7lPNGU5CXDlo+IDfVr64qRxorVimP8l1i5MgC/+VYXzCma0LpIbCu7JTDi1Uapn/of8xsecbnAblUDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e1d2d98aeed59da65800eaf9481dcd7bb07925b3977e9f4e3f155cb13bd9a37e","last_reissued_at":"2026-05-17T23:38:49.883568Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.883568Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"SEED-X is a single multimodal model that comprehends arbitrary-sized images and generates at multiple levels of detail.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chen Li, Jinguo Zhu, Kun Yi, Lin Song, Sijie Zhao, Xiaohan Ding, Ying Shan, Yixiao Ge, Yuying Ge","submitted_at":"2024-04-22T17:56:09Z","abstract_excerpt":"The rapid evolution of multimodal foundation model has demonstrated significant progresses in vision-language understanding and generation, e.g., our previous work SEED-LLaMA. However, there remains a gap between its capability and the real-world applicability, primarily due to the model's limited capacity to effectively respond to various user instructions and interact with diverse visual data. In this work, we focus on bridging this gap through integrating two enhanced features: (1) comprehending images of arbitrary sizes and ratios, and (2) enabling multi-granularity image generation. We pr"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That integrating arbitrary-size image comprehension and multi-granularity generation will close the gap between current model capabilities and real-world applicability, assuming successful instruction tuning preserves performance without introducing new limitations.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"SEED-X is a single multimodal model that comprehends arbitrary-sized images and generates at multiple levels of detail.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8700383e5079f95aef792fb755d3985aa75af1d085e83f9033d9375482034419"},"source":{"id":"2404.14396","kind":"arxiv","version":2},"verdict":{"id":"c2651623-20f4-422c-8e77-fd11ed33b692","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T22:45:38.031890Z","strongest_claim":"We present a unified and versatile foundation model, namely, SEED-X, which is able to model multi-granularity visual semantics for comprehension and generation tasks.","one_line_summary":"SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That integrating arbitrary-size image comprehension and multi-granularity generation will close the gap between current model capabilities and real-world applicability, assuming successful instruction tuning preserves performance without introducing new limitations.","pith_extraction_headline":"SEED-X is a single multimodal model that comprehends arbitrary-sized images and generates at multiple levels of detail."},"references":{"count":76,"sample":[{"doi":"","year":2023,"title":"Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models","work_id":"0717b0f5-1407-4005-9f21-4e2907f265d7","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models","work_id":"a7e3a737-e007-42bc-be89-c4d34c5ee071","ref_index":2,"cited_arxiv_id":"2304.10592","is_internal_anchor":true},{"doi":"","year":2023,"title":"Visual Instruction Tuning","work_id":"68be622d-a6dc-4a13-82de-e3054a3dc509","ref_index":3,"cited_arxiv_id":"2304.08485","is_internal_anchor":true},{"doi":"","year":2023,"title":"Kosmos-2: Grounding Multimodal Large Language Models to the World","work_id":"46e7f9e9-24c6-49af-b7d5-96159fa6f443","ref_index":4,"cited_arxiv_id":"2306.14824","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":5,"cited_arxiv_id":"2308.12966","is_internal_anchor":true}],"resolved_work":76,"snapshot_sha256":"ce315cfe1628555e5f7c5854a6e24f3e58657e368f64b7b625e7d03c77fd72a6","internal_anchors":29},"formal_canon":{"evidence_count":1,"snapshot_sha256":"a2e65a65d39cdd3268ebc2434f43f263d61c09f1db0638bc7c6f491b98c61c67"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2404.14396","created_at":"2026-05-17T23:38:49.883677+00:00"},{"alias_kind":"arxiv_version","alias_value":"2404.14396v2","created_at":"2026-05-17T23:38:49.883677+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2404.14396","created_at":"2026-05-17T23:38:49.883677+00:00"},{"alias_kind":"pith_short_12","alias_value":"4HJNTCXO2WO2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"4HJNTCXO2WO2MWAA","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"4HJNTCXO","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":40,"internal_anchor_count":40,"sample":[{"citing_arxiv_id":"2503.14324","citing_title":"DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.22344","citing_title":"Bernini: Latent Semantic Planning for Video Diffusion","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2601.01593","citing_title":"Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21798","citing_title":"CG-MLLM: Captioning and Generating 3D content via Multi-modal Large Language Models","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15735","citing_title":"UAM: A Dual-Stream Perspective on Forgetting in VLA Training","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15792","citing_title":"Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18115","citing_title":"WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18052","citing_title":"Efficient 3D Content Reconstruction and Generation","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18714","citing_title":"Semantic Generative Tuning for Unified Multimodal Models","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19227","citing_title":"Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16961","citing_title":"Latent Action Control for Reasoning-Guided Unified Image Generation","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23606","citing_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2509.21912","citing_title":"Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2510.21122","citing_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2510.26583","citing_title":"Emu3.5: Native Multimodal Models are World Learners","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2511.08480","citing_title":"Compressing then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2512.00993","citing_title":"PhotoFramer: Multi-modal Image Composition Instruction","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07536","citing_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15507","citing_title":"A Unified and Controllable Framework for Layered Image Generation with Visual Effects","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2410.13848","citing_title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2503.10631","citing_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","ref_index":64,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07265","citing_title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","ref_index":10,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO","json":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO.json","graph_json":"https://pith.science/api/pith-number/4HJNTCXO2WO2MWAA5L4UQHONPO/graph.json","events_json":"https://pith.science/api/pith-number/4HJNTCXO2WO2MWAA5L4UQHONPO/events.json","paper":"https://pith.science/paper/4HJNTCXO"},"agent_actions":{"view_html":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO","download_json":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO.json","view_paper":"https://pith.science/paper/4HJNTCXO","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2404.14396&json=true","fetch_graph":"https://pith.science/api/pith-number/4HJNTCXO2WO2MWAA5L4UQHONPO/graph.json","fetch_events":"https://pith.science/api/pith-number/4HJNTCXO2WO2MWAA5L4UQHONPO/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO/action/timestamp_anchor","attest_storage":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO/action/storage_attestation","attest_author":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO/action/author_attestation","sign_citation":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO/action/citation_signature","submit_replication":"https://pith.science/pith/4HJNTCXO2WO2MWAA5L4UQHONPO/action/replication_record"}},"created_at":"2026-05-17T23:38:49.883677+00:00","updated_at":"2026-05-17T23:38:49.883677+00:00"}