{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:4Q5XUAZ334OCLVGJRTDYNWGGXY","short_pith_number":"pith:4Q5XUAZ3","schema_version":"1.0","canonical_sha256":"e43b7a033bdf1c25d4c98cc786d8c6be3f0a249cee13c37b6adeec1632aa6689","source":{"kind":"arxiv","id":"2410.13848","version":1},"attestation_state":"computed","paper":{"title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoupling the visual encoder into separate pathways lets a single transformer handle both multimodal understanding and generation without performance trade-offs.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Chengyue Wu, Chong Ruan, Ping Luo, Wen Liu, Xiaokang Chen, Xingchao Liu, Xingkai Yu, Yiyang Ma, Zhenda Xie, Zhiyu Wu, Zizheng Pan","submitted_at":"2024-10-17T17:58:37Z","abstract_excerpt":"In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only all"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2410.13848","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2024-10-17T17:58:37Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"5fdd467fda2d0c8f5f6d4b978c5f95603eb588043dd172c2860ea2b4301e3512","abstract_canon_sha256":"ccac9affd22a03abd24c36a6f1dbfd031a9ff1e4871892b274524fd558580680"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:49.992425Z","signature_b64":"DBZHmxwTahZ5s6Lem4K5IbEgQP+Yj2BynNUvm5kXkKENtAY+juRJ0QSG7jzqEeKtvGhvguoXKyqgZyLeLKGKBQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"e43b7a033bdf1c25d4c98cc786d8c6be3f0a249cee13c37b6adeec1632aa6689","last_reissued_at":"2026-05-17T23:38:49.991947Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:49.991947Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoupling the visual encoder into separate pathways lets a single transformer handle both multimodal understanding and generation without performance trade-offs.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Chengyue Wu, Chong Ruan, Ping Luo, Wen Liu, Xiaokang Chen, Xingchao Liu, Xingkai Yu, Yiyang Ma, Zhenda Xie, Zhiyu Wu, Zizheng Pan","submitted_at":"2024-10-17T17:58:37Z","abstract_excerpt":"In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only all"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the conflict arising from differing information granularity in understanding versus generation is the main performance bottleneck and that decoupling the encoders will resolve it without introducing new integration problems in the shared transformer.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Decoupling the visual encoder into separate pathways lets a single transformer handle both multimodal understanding and generation without performance trade-offs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"0a66796c58d4c14dc82e2dcafb1bb013f32a08214434d4518f56713452f97e99"},"source":{"id":"2410.13848","kind":"arxiv","version":1},"verdict":{"id":"b9051b6c-71f6-4519-adc3-0409a76051bf","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T22:06:37.001843Z","strongest_claim":"Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models.","one_line_summary":"Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the conflict arising from differing information granularity in understanding versus generation is the main performance bottleneck and that decoupling the encoders will resolve it without introducing new integration problems in the shared transformer.","pith_extraction_headline":"Decoupling the visual encoder into separate pathways lets a single transformer handle both multimodal understanding and generation without performance trade-offs."},"references":{"count":96,"sample":[{"doi":"","year":2023,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2024,"title":"The claude 3 model family: Opus, sonnet, haiku","work_id":"bf65b1ed-fd04-46c8-820a-913e0eacf201","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":3,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2023,"title":"arXiv preprint arXiv:2306.16934 (2023)","work_id":"3754e7d4-269d-41a6-b9f3-7c94d2e69e99","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"DeepSeek LLM: Scaling Open-Source Language Models with Longtermism","work_id":"01b10587-025b-499d-8ba3-7a538d24c2d6","ref_index":5,"cited_arxiv_id":"2401.02954","is_internal_anchor":true}],"resolved_work":96,"snapshot_sha256":"f57c3c4a6cfc64e099b1dde944eea67b9c76ea714a5b1b47a689cf71314fb373","internal_anchors":38},"formal_canon":{"evidence_count":3,"snapshot_sha256":"eebe8bb6914d2bce0a691e3d2b4d26c46d98efec80ca32906d4c7efc081bd46a"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.13848","created_at":"2026-05-17T23:38:49.992027+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.13848v1","created_at":"2026-05-17T23:38:49.992027+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.13848","created_at":"2026-05-17T23:38:49.992027+00:00"},{"alias_kind":"pith_short_12","alias_value":"4Q5XUAZ334OC","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"4Q5XUAZ334OCLVGJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"4Q5XUAZ3","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":42,"internal_anchor_count":42,"sample":[{"citing_arxiv_id":"2605.23500","citing_title":"B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2503.14324","citing_title":"DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2504.09844","citing_title":"MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2504.06925","citing_title":"Are Vision-Language Models Ready for Dietary Assessment? Exploring the Next Frontier in AI-Powered Food Image Recognition","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21611","citing_title":"UniVL: Unified Vision-Language Embedding for Spatially Grounded Contextual Image Generation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10784","citing_title":"TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20316","citing_title":"FullFlow: Upgrading Text-to-Image Flow Matching Models for Bidirectional Vision--Language Generation","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17766","citing_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18714","citing_title":"Semantic Generative Tuning for Unified Multimodal Models","ref_index":69,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18172","citing_title":"Visualizing the Invisible: Generative Visual Grounding Empowers Universal EEG Understanding in MLLMs","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2510.21122","citing_title":"NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2501.00321","citing_title":"OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning","ref_index":150,"is_internal_anchor":true},{"citing_arxiv_id":"2412.14164","citing_title":"MetaMorph: Multimodal Understanding and Generation via Instruction Tuning","ref_index":271,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":84,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22663","citing_title":"AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2505.16933","citing_title":"LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2512.14098","citing_title":"Cornfigurator: Automated Planning for Any-to-Any Multimodal Model Serving","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07536","citing_title":"LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL","ref_index":80,"is_internal_anchor":true},{"citing_arxiv_id":"2509.06951","citing_title":"F1: A Vision-Language-Action Model Bridging Understanding and Generation to Actions","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15507","citing_title":"A Unified and Controllable Framework for Layered Image Generation with Visual Effects","ref_index":53,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07064","citing_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","ref_index":78,"is_internal_anchor":true},{"citing_arxiv_id":"2503.22020","citing_title":"CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models","ref_index":66,"is_internal_anchor":true},{"citing_arxiv_id":"2503.10631","citing_title":"HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model","ref_index":65,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07265","citing_title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15809","citing_title":"MMaDA: Multimodal Large Diffusion Language Models","ref_index":13,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY","json":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY.json","graph_json":"https://pith.science/api/pith-number/4Q5XUAZ334OCLVGJRTDYNWGGXY/graph.json","events_json":"https://pith.science/api/pith-number/4Q5XUAZ334OCLVGJRTDYNWGGXY/events.json","paper":"https://pith.science/paper/4Q5XUAZ3"},"agent_actions":{"view_html":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY","download_json":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY.json","view_paper":"https://pith.science/paper/4Q5XUAZ3","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.13848&json=true","fetch_graph":"https://pith.science/api/pith-number/4Q5XUAZ334OCLVGJRTDYNWGGXY/graph.json","fetch_events":"https://pith.science/api/pith-number/4Q5XUAZ334OCLVGJRTDYNWGGXY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY/action/storage_attestation","attest_author":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY/action/author_attestation","sign_citation":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY/action/citation_signature","submit_replication":"https://pith.science/pith/4Q5XUAZ334OCLVGJRTDYNWGGXY/action/replication_record"}},"created_at":"2026-05-17T23:38:49.992027+00:00","updated_at":"2026-05-17T23:38:49.992027+00:00"}