{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:TSJHFKADYEHPCHCUL7NT6I62OR","short_pith_number":"pith:TSJHFKAD","schema_version":"1.0","canonical_sha256":"9c9272a803c10ef11c545fdb3f23da74652492389699e99f44a7b54efd11347d","source":{"kind":"arxiv","id":"2503.07265","version":3},"attestation_state":"computed","paper":{"title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"Text-to-image models struggle to apply world knowledge in generated images according to a dedicated new benchmark.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Chaoran Feng, Jiaqi Liao, Kunpeng Ning, Li Yuan, Mengren Zheng, Munan Ning, Peng Jin, Weiyang Jin, Yuwei Niu","submitted_at":"2025-03-10T12:47:53Z","abstract_excerpt":"Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \\textbf{WISE}, the first benchmark specifically designed for \\textbf{W}orld Knowledge-\\textbf{I}nformed \\textbf{S}emantic \\textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging mo"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.07265","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.CV","submitted_at":"2025-03-10T12:47:53Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"a44df9da20e1ddbd7d7d568e7762a43b8193cc99caafb92e23778ca9050dfcb9","abstract_canon_sha256":"bf709dab7ee32b19d092e1a7de86f67f5ce30b8b494b9bf09f2aed3e12bc94aa"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.927171Z","signature_b64":"YvJuP/r4b6J3xC9R7wFI+Q6C6eJh/RqwDLtRv5S7isE6+9fte4/j6Ct1yfDEHNO4ptADEyV5apBHQK9aIts1CA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"9c9272a803c10ef11c545fdb3f23da74652492389699e99f44a7b54efd11347d","last_reissued_at":"2026-05-17T23:38:50.926726Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.926726Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"Text-to-image models struggle to apply world knowledge in generated images according to a dedicated new benchmark.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Bin Lin, Bin Zhu, Chaoran Feng, Jiaqi Liao, Kunpeng Ning, Li Yuan, Mengren Zheng, Munan Ning, Peng Jin, Weiyang Jin, Yuwei Niu","submitted_at":"2025-03-10T12:47:53Z","abstract_excerpt":"Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content. However, existing research and evaluation standards predominantly focus on image realism and shallow text-image alignment, lacking a comprehensive assessment of complex semantic understanding and world knowledge integration in text-to-image generation. To address this challenge, we propose \\textbf{WISE}, the first benchmark specifically designed for \\textbf{W}orld Knowledge-\\textbf{I}nformed \\textbf{S}emantic \\textbf{E}valuation. WISE moves beyond simple word-pixel mapping by challenging mo"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the 1000 crafted prompts and 25 subdomains provide unbiased, comprehensive tests of world knowledge integration without post-hoc selection effects or prompt engineering artifacts.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Text-to-image models struggle to apply world knowledge in generated images according to a dedicated new benchmark.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"edadc41a735de8c196995cd2dcdcbd3ff83b849ba526f3985ac59fcc8c849790"},"source":{"id":"2503.07265","kind":"arxiv","version":3},"verdict":{"id":"1adc6059-3f59-4f90-9467-c82a0595476a","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T16:18:34.201092Z","strongest_claim":"our findings reveal significant limitations in their ability to effectively integrate and apply world knowledge during image generation, highlighting critical pathways for enhancing knowledge incorporation and application in next-generation T2I models.","one_line_summary":"Text-to-image models show significant limitations in integrating world knowledge, as measured by the new WISE benchmark and WiScore metric across 20 models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the 1000 crafted prompts and 25 subdomains provide unbiased, comprehensive tests of world knowledge integration without post-hoc selection effects or prompt engineering artifacts.","pith_extraction_headline":"Text-to-image models struggle to apply world knowledge in generated images according to a dedicated new benchmark."},"references":{"count":61,"sample":[{"doi":"","year":2023,"title":"Pixart-alpha: Fast training of diffusion transformer for photorealistic text-to-image synthe- sis, 2023","work_id":"6917650a-4a8f-4f92-88f8-efd7cd5388a6","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","work_id":"86d896d2-592f-4d9b-938e-dfeb11f9388f","ref_index":2,"cited_arxiv_id":"2505.09568","is_internal_anchor":true},{"doi":"","year":2024,"title":"Next token prediction towards multimodal intelligence: A comprehensive survey","work_id":"5cdd7b0b-6c98-4848-901b-eab0a5c3f9e6","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Generative pretraining from pixels","work_id":"0747dcb4-e395-4187-9b0b-dadb91e6cb2b","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","ref_index":5,"cited_arxiv_id":"2501.17811","is_internal_anchor":true}],"resolved_work":61,"snapshot_sha256":"577510af768e568d8e73cc1b69bd6894d00aa54f407977ae229de95ebe08860b","internal_anchors":23},"formal_canon":{"evidence_count":2,"snapshot_sha256":"94fb97ae272139d7986f88dd4fddf08517d5c0f8c287e222f9e2431bfb1bf164"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.07265","created_at":"2026-05-17T23:38:50.926797+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.07265v3","created_at":"2026-05-17T23:38:50.926797+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.07265","created_at":"2026-05-17T23:38:50.926797+00:00"},{"alias_kind":"pith_short_12","alias_value":"TSJHFKADYEHP","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"TSJHFKADYEHPCHCU","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"TSJHFKAD","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21487","citing_title":"Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21605","citing_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21605","citing_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2510.21583","citing_title":"Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2510.16888","citing_title":"Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10784","citing_title":"TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21487","citing_title":"Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17766","citing_title":"LatentUMM: Dual Latent Alignment for Unified Multimodal Models","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18714","citing_title":"Semantic Generative Tuning for Unified Multimodal Models","ref_index":45,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16789","citing_title":"Accelerating Rectified Flow Models via Trajectory-Aware Caching","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16961","citing_title":"Latent Action Control for Reasoning-Guided Unified Image Generation","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14876","citing_title":"Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2508.20751","citing_title":"Pref-GRPO: Pairwise Preference Reward-based GRPO for Stable Text-to-Image Reinforcement Learning","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2602.01554","citing_title":"InfoTok: Information-Theoretic Regularization for Capacity-Constrained Shared Visual Tokenization in Unified MLLMs","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2512.07584","citing_title":"LongCat-Image Technical Report","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2505.15809","citing_title":"MMaDA: Multimodal Large Diffusion Language Models","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02355","citing_title":"From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2504.06256","citing_title":"Transfer between Modalities with MetaQueries","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28767","citing_title":"Gen-Searcher: Reinforcing Agentic Search for Image Generation","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12500","citing_title":"SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture","ref_index":99,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11400","citing_title":"UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2506.03147","citing_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28185","citing_title":"Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25477","citing_title":"DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing","ref_index":41,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR","json":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR.json","graph_json":"https://pith.science/api/pith-number/TSJHFKADYEHPCHCUL7NT6I62OR/graph.json","events_json":"https://pith.science/api/pith-number/TSJHFKADYEHPCHCUL7NT6I62OR/events.json","paper":"https://pith.science/paper/TSJHFKAD"},"agent_actions":{"view_html":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR","download_json":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR.json","view_paper":"https://pith.science/paper/TSJHFKAD","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.07265&json=true","fetch_graph":"https://pith.science/api/pith-number/TSJHFKADYEHPCHCUL7NT6I62OR/graph.json","fetch_events":"https://pith.science/api/pith-number/TSJHFKADYEHPCHCUL7NT6I62OR/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR/action/timestamp_anchor","attest_storage":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR/action/storage_attestation","attest_author":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR/action/author_attestation","sign_citation":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR/action/citation_signature","submit_replication":"https://pith.science/pith/TSJHFKADYEHPCHCUL7NT6I62OR/action/replication_record"}},"created_at":"2026-05-17T23:38:50.926797+00:00","updated_at":"2026-05-17T23:38:50.926797+00:00"}