{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:26B4PUA7HOATI4KNCFQ6OEC5U3","short_pith_number":"pith:26B4PUA7","schema_version":"1.0","canonical_sha256":"d783c7d01f3b8134714d1161e7105da6f90b842d6842ffd0c6fb1f0d3a02c1c2","source":{"kind":"arxiv","id":"2503.07703","version":1},"attestation_state":"computed","paper":{"title":"Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fanshi Li, Fei Liu, Guofeng Wu, Jianchao Yang, Jie Wu, Liang Li, Linjie Yang, Lixue Gong, Liyang Liu, Peng Wang, Qi Zhang, Shijia Zhao, Shiqi Sun, Weilin Huang, Wei Liu, Wei Lu, Xiaochen Lian, Xiaoxia Hou, Xin Xia, Xinyu Zhang, Xuefeng Xiao, Xun Wang, Ye Wang, Yichun Shi, Yu Tian, Yuwei Zhang, Zhi Tian, Zhonghua Zhai","submitted_at":"2025-03-10T17:58:33Z","abstract_excerpt":"Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We devel"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2503.07703","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CV","submitted_at":"2025-03-10T17:58:33Z","cross_cats_sorted":[],"title_canon_sha256":"92d43a42b4d765aa91d70be74acd529bec7900888860943c5bd4fbeb07ee27cf","abstract_canon_sha256":"32a97c3ea5cc16f3df4e2381704c161cc3204d1b3a91a392238e59cbed760f09"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:14.574737Z","signature_b64":"fOG6OA8fj6EY3ceF0nymKDv9cw9eJil7jNjBeQGmZvVRnEnTytdA2nLC/dcriH5xENzdG4uoSNL38E8ke4HLBA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"d783c7d01f3b8134714d1161e7105da6f90b842d6842ffd0c6fb1f0d3a02c1c2","last_reissued_at":"2026-05-17T23:38:14.574025Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:14.574025Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Fanshi Li, Fei Liu, Guofeng Wu, Jianchao Yang, Jie Wu, Liang Li, Linjie Yang, Lixue Gong, Liyang Liu, Peng Wang, Qi Zhang, Shijia Zhao, Shiqi Sun, Weilin Huang, Wei Liu, Wei Lu, Xiaochen Lian, Xiaoxia Hou, Xin Xia, Xinyu Zhang, Xuefeng Xiao, Xun Wang, Ye Wang, Yichun Shi, Yu Tian, Yuwei Zhang, Zhi Tian, Zhonghua Zhai","submitted_at":"2025-03-10T17:58:33Z","abstract_excerpt":"Rapid advancement of diffusion models has catalyzed remarkable progress in the field of image generation. However, prevalent models such as Flux, SD3.5 and Midjourney, still grapple with issues like model bias, limited text rendering capabilities, and insufficient understanding of Chinese cultural nuances. To address these limitations, we present Seedream 2.0, a native Chinese-English bilingual image generation foundation model that excels across diverse dimensions, which adeptly manages text prompt in both Chinese and English, supporting bilingual image generation and text rendering. We devel"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the self-developed bilingual LLM text encoder and the custom data/caption systems allow the model to learn native Chinese knowledge directly from data without introducing new biases or requiring post-hoc fixes that undermine the claimed native performance.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"eb9bd8a5c2affd3a522a9b02c84d4b8d6f2efa9e335fb41c56c40f99bf44e519"},"source":{"id":"2503.07703","kind":"arxiv","version":1},"verdict":{"id":"8d7e920a-efa1-402c-b882-7b4fd604c2b5","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T08:20:40.510849Z","strongest_claim":"Through extensive experimentation, we demonstrate that Seedream 2.0 achieves state-of-the-art performance across multiple aspects, including prompt-following, aesthetics, text rendering, and structural correctness. Furthermore, Seedream 2.0 has been optimized through multiple RLHF iterations to closely align its output with human preferences, as revealed by its outstanding ELO score.","one_line_summary":"Seedream 2.0 is a native Chinese-English bilingual diffusion model that integrates a self-developed LLM text encoder, Glyph-Aligned ByT5, and Scaled ROPE to reach claimed state-of-the-art results in prompt following, aesthetics, text rendering, and human preference alignment via RLHF.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the self-developed bilingual LLM text encoder and the custom data/caption systems allow the model to learn native Chinese knowledge directly from data without introducing new biases or requiring post-hoc fixes that undermine the claimed native performance.","pith_extraction_headline":"Seedream 2.0 uses a self-developed bilingual LLM text encoder to generate high-fidelity images from Chinese or English prompts with accurate cultural nuances."},"references":{"count":44,"sample":[{"doi":"","year":2023,"title":"Training Diffusion Models with Reinforcement Learning","work_id":"67684dda-3930-452a-b91a-36cbb8e2e219","ref_index":1,"cited_arxiv_id":"2305.13301","is_internal_anchor":true},{"doi":"","year":2023,"title":"Instructpix2pix: Learning to follow image editing instructions","work_id":"0e06497a-be4c-4f15-bd2a-d2007dec6596","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Masactrl: Tuning- free mutual self-attention control for consistent image synthesis and editing","work_id":"1c7a9136-44e8-4cf5-8a5d-505230d11c07","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Textdiffuser-2: Unleashing the power of language models for text rendering","work_id":"bc520157-e60c-42f7-bcad-2bf4ba5d5b8d","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Altclip: Altering the lan- guage encoder in clip for extended language capabilities","work_id":"974c0da1-deee-44f7-be9c-d681153a486c","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":44,"snapshot_sha256":"13c72704d755c989300c683e2cc85e6622f858a6391261bad926a8520fa91f08","internal_anchors":8},"formal_canon":{"evidence_count":2,"snapshot_sha256":"fe2a5df5f09cfa1f52a55f942271ba6cdc73e732d14caabaa89992f4b2e7c63d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2503.07703","created_at":"2026-05-17T23:38:14.574140+00:00"},{"alias_kind":"arxiv_version","alias_value":"2503.07703v1","created_at":"2026-05-17T23:38:14.574140+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2503.07703","created_at":"2026-05-17T23:38:14.574140+00:00"},{"alias_kind":"pith_short_12","alias_value":"26B4PUA7HOAT","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"26B4PUA7HOATI4KN","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"26B4PUA7","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":19,"internal_anchor_count":19,"sample":[{"citing_arxiv_id":"2604.27505","citing_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2509.01986","citing_title":"Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2511.19365","citing_title":"DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2512.07584","citing_title":"LongCat-Image Technical Report","ref_index":2,"is_internal_anchor":true},{"citing_arxiv_id":"2602.02493","citing_title":"PixelGen: Improving Pixel Diffusion with Perceptual Supervision","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13507","citing_title":"Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00918","citing_title":"Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2504.11346","citing_title":"Seedream 3.0 Technical Report","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20427","citing_title":"Seedream 4.0: Toward Next-generation Multimodal Image Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27505","citing_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10730","citing_title":"Qwen-Image-2.0 Technical Report","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25427","citing_title":"A Systematic Post-Train Framework for Video Generation","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2505.07818","citing_title":"DanceGRPO: Unleashing GRPO on Visual Generation","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05470","citing_title":"Flow-GRPO: Training Flow Matching Models via Online RL","ref_index":58,"is_internal_anchor":true},{"citing_arxiv_id":"2506.09113","citing_title":"Seedance 1.0: Exploring the Boundaries of Video Generation Models","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2508.02324","citing_title":"Qwen-Image Technical Report","ref_index":11,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14148","citing_title":"Seedance 2.0: Advancing Video Generation for World Complexity","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19902","citing_title":"MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings","ref_index":29,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3","json":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3.json","graph_json":"https://pith.science/api/pith-number/26B4PUA7HOATI4KNCFQ6OEC5U3/graph.json","events_json":"https://pith.science/api/pith-number/26B4PUA7HOATI4KNCFQ6OEC5U3/events.json","paper":"https://pith.science/paper/26B4PUA7"},"agent_actions":{"view_html":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3","download_json":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3.json","view_paper":"https://pith.science/paper/26B4PUA7","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2503.07703&json=true","fetch_graph":"https://pith.science/api/pith-number/26B4PUA7HOATI4KNCFQ6OEC5U3/graph.json","fetch_events":"https://pith.science/api/pith-number/26B4PUA7HOATI4KNCFQ6OEC5U3/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3/action/timestamp_anchor","attest_storage":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3/action/storage_attestation","attest_author":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3/action/author_attestation","sign_citation":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3/action/citation_signature","submit_replication":"https://pith.science/pith/26B4PUA7HOATI4KNCFQ6OEC5U3/action/replication_record"}},"created_at":"2026-05-17T23:38:14.574140+00:00","updated_at":"2026-05-17T23:38:14.574140+00:00"}