{"paper":{"title":"Transfer between Modalities with MetaQueries","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Aashu Singh, Felix Juefei-Xu, Jialiang Wang, Ji Hou, Jiuhai Chen, Kunpeng Li, Saining Xie, Satya Narayan Shukla, Shlok Kumar Mishra, Xichen Pan, Zhiyang Xu, Zhuokai Zhao","submitted_at":"2025-04-08T17:58:47Z","abstract_excerpt":"Unified multimodal models aim to integrate understanding (text output) and generation (pixel output), but aligning these different modalities within a single architecture often demands complex training recipes and careful data balancing. We introduce MetaQueries, a set of learnable queries that act as an efficient interface between autoregressive multimodal LLMs (MLLMs) and diffusion models. MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That a set of learnable queries can effectively align and transfer knowledge from MLLM latents to a diffusion decoder using only standard paired image-caption data and diffusion objectives, without requiring complex training recipes or unfreezing the MLLM.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"25c2ddb7fde32d7e035213dd9905a43e16204d8961ea6486b717ed0c6f80cf3b"},"source":{"id":"2504.06256","kind":"arxiv","version":1},"verdict":{"id":"9596d158-638f-42f0-9513-b8f0109ccbee","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T22:45:05.649463Z","strongest_claim":"MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen.","one_line_summary":"MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That a set of learnable queries can effectively align and transfer knowledge from MLLM latents to a diffusion decoder using only standard paired image-caption data and diffusion objectives, without requiring complex training recipes or unfreezing the MLLM.","pith_extraction_headline":"MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation."},"references":{"count":21,"sample":[{"doi":"","year":null,"title":"Qwen2.5-VL Technical Report","work_id":"69dffacb-bfe8-442d-be86-48624c60426f","ref_index":1,"cited_arxiv_id":"2502.13923","is_internal_anchor":true},{"doi":"","year":null,"title":"Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling","work_id":"67d9e391-26d1-459e-ab56-07e60511c886","ref_index":2,"cited_arxiv_id":"2501.17811","is_internal_anchor":true},{"doi":"","year":null,"title":"MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models","work_id":"806d2e73-71b3-4d56-87e0-39d571cc15d6","ref_index":3,"cited_arxiv_id":"2306.13394","is_internal_anchor":true},{"doi":"","year":null,"title":"Planting a seed of vision in large language model","work_id":"a97ecc74-b2ab-4837-bdc1-0a385272b7e9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation","work_id":"15953092-dd9e-49ae-9f72-e28fc93a6068","ref_index":5,"cited_arxiv_id":"2404.14396","is_internal_anchor":true}],"resolved_work":21,"snapshot_sha256":"c6d969936a2ed975ddab80c337d5f66cfbc06259892285a17c0c2defbb2a97e2","internal_anchors":14},"formal_canon":{"evidence_count":2,"snapshot_sha256":"df52001c03b6d2d4e02b9322a2f7ee6902040618abf4e1a6768eb1b26a75f7d3"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}