pith. sign in
Pith Number

pith:5Z2PM44Y

pith:2025:5Z2PM44YNHCKY3LJFWW7V4PNUC
not attested not anchored not stored refs resolved

Transfer between Modalities with MetaQueries

Aashu Singh, Felix Juefei-Xu, Jialiang Wang, Ji Hou, Jiuhai Chen, Kunpeng Li, Saining Xie, Satya Narayan Shukla, Shlok Kumar Mishra, Xichen Pan, Zhiyang Xu, Zhuokai Zhao

MetaQueries are learnable queries that transfer knowledge from frozen multimodal LLMs to diffusion models for image generation.

arxiv:2504.06256 v1 · 2025-04-08 · cs.CV

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{5Z2PM44YNHCKY3LJFWW7V4PNUC}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

MetaQueries connects the MLLM's latents to the diffusion decoder, enabling knowledge-augmented image generation by leveraging the MLLM's deep understanding and reasoning capabilities. Our method simplifies training, requiring only paired image-caption data and standard diffusion objectives. Notably, this transfer is effective even when the MLLM backbone remains frozen.

C2weakest assumption

That a set of learnable queries can effectively align and transfer knowledge from MLLM latents to a diffusion decoder using only standard paired image-caption data and diffusion objectives, without requiring complex training recipes or unfreezing the MLLM.

C3one line summary

MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.

References

21 extracted · 21 resolved · 14 Pith anchors

[1] Qwen2.5-VL Technical Report · arXiv:2502.13923
[2] Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling · arXiv:2501.17811
[3] MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models · arXiv:2306.13394
[4] Planting a seed of vision in large language model
[5] SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation · arXiv:2404.14396

Formal links

2 machine-checked theorem links

Cited by

38 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:21.408583Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

ee74f6739869c4ac6d692dadfaf1eda0b5f84c54cf5f19f4f32d231f81a4419a

Aliases

arxiv: 2504.06256 · arxiv_version: 2504.06256v1 · doi: 10.48550/arxiv.2504.06256 · pith_short_12: 5Z2PM44YNHCK · pith_short_16: 5Z2PM44YNHCKY3LJ · pith_short_8: 5Z2PM44Y
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/5Z2PM44YNHCKY3LJFWW7V4PNUC \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: ee74f6739869c4ac6d692dadfaf1eda0b5f84c54cf5f19f4f32d231f81a4419a
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "bff4d7bff84bc837c329f0edd5ebfeedcfbe637763f853bfe93cacd4c2c757ab",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2025-04-08T17:58:47Z",
    "title_canon_sha256": "5327072016c1ace55eba887b15e3cfe7796d3aca394e1fe500d5b3f919810b01"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.06256",
    "kind": "arxiv",
    "version": 1
  }
}