{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:N4CBQLK5SGMGON3BE2FVVZUSML","short_pith_number":"pith:N4CBQLK5","schema_version":"1.0","canonical_sha256":"6f04182d5d9198673761268b5ae69262cff80490a194e1d9a7ebe5dbb5ab19b7","source":{"kind":"arxiv","id":"2405.08748","version":1},"attestation_state":"computed","paper":{"title":"Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chao Zhang, Chen Zhang, Dayou Chen, Di Wang, Dongdong Wang, Jiabin Huang, Jiahao Li, Jiajun He, Jianchen Zhu, Jiangfeng Xiong, Jianwei Zhang, Jianxiang Lu, Jie Jiang, Jie Liu, Jihong Zhang, Jinbao Xue, Kai Liu, Meng Chen, Minbin Huang, Mingtao Chen, Qinglin Lu, Qin Lin, Rongwei Quan, Sihuan Lin, Wei Liu, Weiyan Wang, Wenyue Li, Xiao Xiao, Xiaoxiao Zheng, Xiaoyan Yuan, Xinchi Deng, Xingchao Liu, Yan Chen, Yangyu Tao, Yanxin Long, Yifu Sun, Yingfang Zhang, Yixuan Li, Yong Yang, Yuhong Liu, Yun Li, Zedong Xiao, Zheng Fang, Zhichao Hu, Zhimin Li","submitted_at":"2024-05-14T16:33:25Z","abstract_excerpt":"We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context."},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2405.08748","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-05-14T16:33:25Z","cross_cats_sorted":[],"title_canon_sha256":"afbd35cf94824e1120a5343405c22f755b9c1d221d72612be2c8eac55caee103","abstract_canon_sha256":"608279be699ea2ea47c30b3d92f93c2d0c6b6602af234c4513d133a411f0fd80"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.513461Z","signature_b64":"CjGNoXBdmsNpCZLOpOGRGBaezz589soNgJO4i3qyUTeLwYD/UCuaN3LeUqN/Cc5fQdVCnjwoW3Bhcyjuvw9RDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"6f04182d5d9198673761268b5ae69262cff80490a194e1d9a7ebe5dbb5ab19b7","last_reissued_at":"2026-05-17T23:38:47.512816Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.512816Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Chao Zhang, Chen Zhang, Dayou Chen, Di Wang, Dongdong Wang, Jiabin Huang, Jiahao Li, Jiajun He, Jianchen Zhu, Jiangfeng Xiong, Jianwei Zhang, Jianxiang Lu, Jie Jiang, Jie Liu, Jihong Zhang, Jinbao Xue, Kai Liu, Meng Chen, Minbin Huang, Mingtao Chen, Qinglin Lu, Qin Lin, Rongwei Quan, Sihuan Lin, Wei Liu, Weiyan Wang, Wenyue Li, Xiao Xiao, Xiaoxiao Zheng, Xiaoyan Yuan, Xinchi Deng, Xingchao Liu, Yan Chen, Yangyu Tao, Yanxin Long, Yifu Sun, Yingfang Zhang, Yixuan Li, Yong Yang, Yuhong Liu, Yun Li, Zedong Xiao, Zheng Fang, Zhichao Hu, Zhimin Li","submitted_at":"2024-05-14T16:33:25Z","abstract_excerpt":"We present Hunyuan-DiT, a text-to-image diffusion transformer with fine-grained understanding of both English and Chinese. To construct Hunyuan-DiT, we carefully design the transformer structure, text encoder, and positional encoding. We also build from scratch a whole data pipeline to update and evaluate data for iterative model optimization. For fine-grained language understanding, we train a Multimodal Large Language Model to refine the captions of the images. Finally, Hunyuan-DiT can perform multi-turn multimodal dialogue with users, generating and refining images according to the context."},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The human evaluation protocol with 50+ evaluators fairly measures fine-grained Chinese understanding without bias from prompt selection, evaluator background, or post-hoc comparison choices.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4fd5a1882387c6555a249aa1ee7185f7424c1eb310a8b50e323bf0da6fb3338c"},"source":{"id":"2405.08748","kind":"arxiv","version":1},"verdict":{"id":"f69b405d-528b-4731-aa5e-0a50c8ef8fef","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T14:54:58.300525Z","strongest_claim":"Through our holistic human evaluation protocol with more than 50 professional human evaluators, Hunyuan-DiT sets a new state-of-the-art in Chinese-to-image generation compared with other open-source models.","one_line_summary":"Hunyuan-DiT is a new multi-resolution diffusion transformer that achieves state-of-the-art Chinese text-to-image generation through custom architecture, data pipelines, and multimodal caption refinement.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The human evaluation protocol with 50+ evaluators fairly measures fine-grained Chinese understanding without bias from prompt selection, evaluator background, or post-hoc comparison choices.","pith_extraction_headline":"Hunyuan-DiT is a diffusion transformer that generates images from Chinese text with state-of-the-art detail through custom architecture and refined data handling."},"references":{"count":41,"sample":[{"doi":"","year":null,"title":"https://www.midjourney.com/home","work_id":"795b1fa1-556e-485b-9d28-16aaf6e1e8dc","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":2,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2022,"title":"eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers","work_id":"2cd7b629-ab37-4ce5-b51e-aa4d99547468","ref_index":3,"cited_arxiv_id":"2211.01324","is_internal_anchor":true},{"doi":"","year":2023,"title":"All are worth words: A vit backbone for diffusion models","work_id":"89898e1d-a850-4f1b-a8db-ff8620eda7bd","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Improving image generation with better captions","work_id":"aa4e9e1a-4c37-468d-bdb4-412819771b5e","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":41,"snapshot_sha256":"1d5f18ee11c02aa59a1050e54cd7893913e5cd799d7f3fb0e0c1d9e60dcc71ef","internal_anchors":6},"formal_canon":{"evidence_count":1,"snapshot_sha256":"1b8449eae01016e5199bc864ab98f3d346fdccd03c8aff1532eda75880a08b6d"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2405.08748","created_at":"2026-05-17T23:38:47.512907+00:00"},{"alias_kind":"arxiv_version","alias_value":"2405.08748v1","created_at":"2026-05-17T23:38:47.512907+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2405.08748","created_at":"2026-05-17T23:38:47.512907+00:00"},{"alias_kind":"pith_short_12","alias_value":"N4CBQLK5SGMG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"N4CBQLK5SGMGON3B","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"N4CBQLK5","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":39,"internal_anchor_count":39,"sample":[{"citing_arxiv_id":"2605.21605","citing_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.23381","citing_title":"VDE: Training-Free Accelerating Rectified Flow Model via Velocity Decomposition and Estimation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2412.00131","citing_title":"Open-Sora Plan: Open-Source Large Video Generation Model","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2503.02537","citing_title":"RectifiedHR: Enable Efficient High-Resolution Synthesis via Energy Rectification","ref_index":31,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21605","citing_title":"GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2510.16888","citing_title":"Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27505","citing_title":"Leveraging Verifier-Based Reinforcement Learning in Image Editing","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21272","citing_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","ref_index":57,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21123","citing_title":"Linear-DPO: Linear Direct Preference Optimization for Diffusion and Flow-Matching Generative Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15684","citing_title":"ElasticDiT: Efficient Diffusion Transformers via Elastic Architecture and Sparse Attention for High-Resolution Image Generation on Mobile Devices","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.17294","citing_title":"HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18115","citing_title":"WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":62,"is_internal_anchor":true},{"citing_arxiv_id":"2605.19532","citing_title":"Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2502.10248","citing_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18871","citing_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07703","citing_title":"Seedream 2.0: A Native Chinese-English Bilingual Image Generation Foundation Model","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2505.05472","citing_title":"Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2511.20645","citing_title":"PixelDiT: Pixel Diffusion Transformers for Image Generation","ref_index":49,"is_internal_anchor":true},{"citing_arxiv_id":"2601.09896","citing_title":"The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2602.07064","citing_title":"OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization","ref_index":77,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23951","citing_title":"HunyuanImage 3.0 Technical Report","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07598","citing_title":"VACE: All-in-One Video Creation and Editing","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2410.10629","citing_title":"SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers","ref_index":11,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML","json":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML.json","graph_json":"https://pith.science/api/pith-number/N4CBQLK5SGMGON3BE2FVVZUSML/graph.json","events_json":"https://pith.science/api/pith-number/N4CBQLK5SGMGON3BE2FVVZUSML/events.json","paper":"https://pith.science/paper/N4CBQLK5"},"agent_actions":{"view_html":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML","download_json":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML.json","view_paper":"https://pith.science/paper/N4CBQLK5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2405.08748&json=true","fetch_graph":"https://pith.science/api/pith-number/N4CBQLK5SGMGON3BE2FVVZUSML/graph.json","fetch_events":"https://pith.science/api/pith-number/N4CBQLK5SGMGON3BE2FVVZUSML/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML/action/timestamp_anchor","attest_storage":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML/action/storage_attestation","attest_author":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML/action/author_attestation","sign_citation":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML/action/citation_signature","submit_replication":"https://pith.science/pith/N4CBQLK5SGMGON3BE2FVVZUSML/action/replication_record"}},"created_at":"2026-05-17T23:38:47.512907+00:00","updated_at":"2026-05-17T23:38:47.512907+00:00"}