{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:5Z2PM44YNHCKY3LJFWW7V4PNUC","short_pith_number":"pith:5Z2PM44Y","schema_version":"1.0","canonical_sha256":"ee74f6739869c4ac6d692dadfaf1eda0b5f84c54cf5f19f4f32d231f81a4419a","source":{"kind":"arxiv","id":"2504.06256","version":1},"attestation_state":"computed","paper":{"title":"Transfer between Modalities with MetaQueries","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Aashu Singh, Felix Juefei-Xu, Jialiang Wang, Ji Hou, Jiuhai Chen, Kunpeng Li, Saining Xie, Satya Narayan Shukla, Shlok Kumar Mishra, Xichen Pan, Zhiyang Xu, Zhuokai Zhao","submitted_at":"2025-04-08T17:58:47Z","abstract_excerpt":"Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2504.06256","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2025-04-08T17:58:47Z","cross_cats_sorted":[],"title_canon_sha256":"5327072016c1ace55eba887b15e3cfe7796d3aca394e1fe500d5b3f919810b01","abstract_canon_sha256":"bff4d7bff84bc837c329f0edd5ebfeedcfbe637763f853bfe93cacd4c2c757ab"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:39:21.409230Z","signature_b64":"m9JJHw4N6Bmn9RCxajLTHy5axCOPOgvAVvQLA44BqiQfEetCyVzQn/zbM5tKkr/Q6aEFBTTG6r26Se70u1N3Bg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"ee74f6739869c4ac6d692dadfaf1eda0b5f84c54cf5f19f4f32d231f81a4419a","last_reissued_at":"2026-05-17T23:39:21.408583Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:39:21.408583Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Transfer between Modalities with MetaQueries","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Aashu Singh, Felix Juefei-Xu, Jialiang Wang, Ji Hou, Jiuhai Chen, Kunpeng Li, Saining Xie, Satya Narayan Shukla, Shlok Kumar Mishra, Xichen Pan, Zhiyang Xu, Zhuokai Zhao","submitted_at":"2025-04-08T17:58:47Z","abstract_excerpt":"Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a set of learnable queries can effectively align and transfer knowledge from MLLM latents to a diffusion decoder using only standard paired image-caption data and diffusion objectives, without requiring complex training recipes or unfreezing the MLLM.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"25c2ddb7fde32d7e035213dd9905a43e16204d8961ea6486b717ed0c6f80cf3b"},"source":{"id":"2504.06256","kind":"arxiv","version":1},"verdict":{"id":"9596d158-638f-42f0-9513-b8f0109ccbee","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:45:05.649463Z","strongest_claim":"MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen.","one_line_summary":"MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a set of learnable queries can effectively align and transfer knowledge from MLLM latents to a diffusion decoder using only standard paired image-caption data and diffusion objectives, without requiring complex training recipes or unfreezing the MLLM.","pith_extraction_headline":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation."},"references":{"count":21,"sample":[{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":1,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","ref_index":2,"cited_arxiv_id":"2501.17811","is_internal_anchor":true},{"doi":"","year":null,"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","ref_index":3,"cited_arxiv_id":"2306.13394","is_internal_anchor":true},{"doi":"","year":null,"title":"Planting a seed of vision in large language model","work_id":"a97ecc74-b2ab-4837-bdc1-0a385272b7e9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","work_id":"15953092-dd9e-49ae-9f72-e28fc93a6068","ref_index":5,"cited_arxiv_id":"2404.14396","is_internal_anchor":true}],"resolved_work":21,"snapshot_sha256":"c6d969936a2ed975ddab80c337d5f66cfbc06259892285a17c0c2defbb2a97e2","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"df52001c03b6d2d4e02b9322a2f7ee6902040618abf4e1a6768eb1b26a75f7d3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2504.06256","created_at":"2026-05-17T23:39:21.408682+00:00"},{"alias_kind":"arxiv_version","alias_value":"2504.06256v1","created_at":"2026-05-17T23:39:21.408682+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2504.06256","created_at":"2026-05-17T23:39:21.408682+00:00"},{"alias_kind":"pith_short_12","alias_value":"5Z2PM44YNHCK","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"5Z2PM44YNHCKY3LJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"5Z2PM44Y","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":38,"internal_anchor_count":38,"sample":[{"citing_arxiv_id":"2605.22344","citing_title":"Bernini: Latent Semantic Planning for Video Diffusion","ref_index":54,"is_internal_anchor":true},{"citing_arxiv_id":"2602.18532","citing_title":"VLANeXt: Recipes for Building Strong VLA Models","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":90,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20795","citing_title":"What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15792","citing_title":"Reversing the Flow: Generation-to-Understanding Synergy in Large Multimodal Models","ref_index":28,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18678","citing_title":"Lance: Unified Multimodal Modeling by Multi-Task Synergy","ref_index":89,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18714","citing_title":"Semantic Generative Tuning for Unified Multimodal Models","ref_index":47,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16961","citing_title":"Latent Action Control for Reasoning-Guided Unified Image Generation","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23606","citing_title":"Muddit: Liberating Generation Beyond Text-to-Image with a Unified Discrete Diffusion Model","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2506.18871","citing_title":"OmniGen2: Towards Instruction-Aligned Multimodal Generation","ref_index":55,"is_internal_anchor":true},{"citing_arxiv_id":"2509.01986","citing_title":"Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2511.22663","citing_title":"AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model","ref_index":27,"is_internal_anchor":true},{"citing_arxiv_id":"2504.20690","citing_title":"In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer","ref_index":59,"is_internal_anchor":true},{"citing_arxiv_id":"2601.15507","citing_title":"A Unified and Controllable Framework for Layered Image Generation with Visual Effects","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2512.07584","citing_title":"LongCat-Image Technical Report","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2503.07265","citing_title":"WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2601.03233","citing_title":"LTX-2: Efficient Joint Audio-Visual Foundation Model","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2506.15564","citing_title":"Show-o2: Improved Native Unified Multimodal Models","ref_index":83,"is_internal_anchor":true},{"citing_arxiv_id":"2506.03147","citing_title":"UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.03269","citing_title":"RLDX-1 Technical Report","ref_index":85,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13710","citing_title":"SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2505.09568","citing_title":"BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25636","citing_title":"Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2604.24625","citing_title":"Meta-CoT: Enhancing Granularity and Generalization in Image Editing","ref_index":46,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05781","citing_title":"Steering Visual Generation in Unified Multimodal Models with Understanding Supervision","ref_index":36,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC","json":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC.json","graph_json":"https://pith.science/api/pith-number/5Z2PM44YNHCKY3LJFWW7V4PNUC/graph.json","events_json":"https://pith.science/api/pith-number/5Z2PM44YNHCKY3LJFWW7V4PNUC/events.json","paper":"https://pith.science/paper/5Z2PM44Y"},"agent_actions":{"view_html":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC","download_json":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC.json","view_paper":"https://pith.science/paper/5Z2PM44Y","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2504.06256&json=true","fetch_graph":"https://pith.science/api/pith-number/5Z2PM44YNHCKY3LJFWW7V4PNUC/graph.json","fetch_events":"https://pith.science/api/pith-number/5Z2PM44YNHCKY3LJFWW7V4PNUC/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC/action/timestamp_anchor","attest_storage":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC/action/storage_attestation","attest_author":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC/action/author_attestation","sign_citation":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC/action/citation_signature","submit_replication":"https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC/action/replication_record"}},"created_at":"2026-05-17T23:39:21.408682+00:00","updated_at":"2026-05-17T23:39:21.408682+00:00"}