pith. sign in

arxiv: 2606.28696 · v1 · pith:TGHW3QMFnew · submitted 2026-06-27 · 💻 cs.AI

COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models

Pith reviewed 2026-06-30 10:08 UTC · model grok-4.3

classification 💻 cs.AI
keywords composition intentunified multimodal modelsexpert tokenMoE backbonecomposition-guided generationComp-11 datasetdenoising conditioninglayout control
0
0 comments X

The pith

A shared expert token anchors composition intent across perception and generation in one multimodal model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

COMPASS presents a unified framework that handles both recognizing fine-grained compositional intent in scenes and turning that intent into controllable image generation. It achieves this by minimally injecting composition expertise into an MoE backbone, distilling the result into a single expert token τ_c, and then reusing that token as a global conditioning signal during denoising. The system is trained and evaluated on a new Comp-11 dataset that provides an 11-class taxonomy with reasoning-augmented annotations. A sympathetic reader would care because the approach converts passive analysis of subject placement and scene organization into explicit layout control inside a single model. If correct, it removes the need for separate perception and generation pipelines while improving both category-level understanding and prompt faithfulness.

Core claim

COMPASS is the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token τ_c as the central intent anchor. On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into τ_c. On the generation side, COMPASS reuses τ_c as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control. Extensive experiments show that COMPASS substantially improves category-level composition understanding an

What carries the argument

The shared expert token τ_c, which serves as the central intent anchor by receiving distilled composition expertise from perception and then acting as global conditioning for the generation denoising process.

If this is right

  • Substantially improves category-level composition understanding.
  • Delivers more composition-consistent, prompt-faithful generation than strong baselines.
  • Supports systematic instruction-following composition learning and evaluation at scale via the Comp-11 dataset with its 11-class taxonomy.
  • Converts passive composition analysis into explicit layout control by reusing the same token across perception and generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The token reuse mechanism could be tested on other high-level visual intents such as object relations or scene lighting to see whether the same distillation pattern works without new datasets.
  • Minimal changes to existing MoE backbones suggest the method might be applied to retrofit current unified multimodal models for better intent grounding.
  • Expanding the 11-class taxonomy with finer subcategories could produce tokens that enable more precise layout adjustments during generation.
  • Reusing a single distilled token as conditioning may reduce the computational overhead of maintaining separate perception and generation modules in multimodal systems.

Load-bearing premise

That minimally invasive injection of composition expertise into an MoE backbone plus distillation into τ_c will produce a token that can be directly reused as an effective global conditioning signal for denoising without additional training or architectural changes.

What would settle it

An experiment measuring composition consistency metrics on generated images where the model is run once with τ_c conditioning and once without it, showing no statistically significant difference between the two conditions.

Figures

Figures reproduced from arXiv: 2606.28696 by Dandan Zheng, Dong-Ming Yan, Jingdong Chen, Jun Zhou, Mining Tan, Weiming Dong, Weize Quan, Zhihan Chen, Ziqi Zhou.

Figure 1
Figure 1. Figure 1: Performance of the proposed COMPASS: the unified system handles various composition understanding tasks, and further uses a reference image as layout intent to synthesize new content that remains faithful to the text prompt. Abstract. Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine… view at source ↗
Figure 2
Figure 2. Figure 2: Visual examples of our Comp-11 taxonomy. soning. (3) We propose COMPASS, featuring a learnable prefix token τc as a shared intent anchor for both understanding and generation tasks, and a self-supervised structural bottlenecking framework that enables reference-guided generation without the need for paired training images. 2 Related Work Composition understanding and optimization. Photographic composi￾tion… view at source ↗
Figure 3
Figure 3. Figure 3: Data construction pipeline of Comp-11. Taxonomy of Composition. We consolidate composition concepts from ex￾isting composition classification datasets KU-PCP [20] and CADB [40] into a unified taxonomy with C=11 commonly used categories, which we adopt as the label set of our dataset. Specifically, the taxonomy includes Rule of Thirds (RoT), Center, Horizontal, Symmetric, Diagonal, Curved, Vertical, Triangl… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of COMPASS. The model first performs composition perception with N new experts (C-MoE) and an expert token τc to summarize layout cues. For composition-guided generation, a structurally bottlenecked (pixelated) reference is used to suppress appearance leakage, while a customized attention mask (demonstrated un￾der the bold “Generation”, where the dark squares represent “allow to attend”, while the… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of structural bottlenecks for reference-guided generation. Pixelated grayscale references suppress semantic identifiability by disrupting contour-level cues while preserving coarse spatial layout, leading to better layout transfer with reduced content leakage compared to edge- or blur-based alternatives. Concretely, we introduce a structural bottleneck before the model: we trans￾form the referen… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison on composition-guided generation. Gray annotations under the captions are for intuitive illustration only and are not used as model inputs. composition-blind observers into expert-level aesthetic creators, establishing a generalizable blueprint for domain-specific multimodal intelligence. Acknowledgment. This work was partially supported by the National Natural Science Foundation of … view at source ↗
Figure 7
Figure 7. Figure 7: Typical examples for the 11 composition categories in Comp-11. Rule of Thirds (RoT). The main subject(s) are placed near the intersections or lines of a 3 × 3 grid. Center. The primary subject is centered, with salient mass concentrated around the image center. Horizontal / Vertical. The global structure is dominated by horizontal (resp. vertical) alignments, e.g., horizons or vertical pillars [PITH_FULL_… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison on multiple-choice composition identification. We color￾code each option for intuitive reading: green = correctly selected, red = incorrectly se￾lected, blue = missed ground-truth option. Compared with strong baselines, COMPASS selects composition techniques more accurately and avoids “hedging” by over-selecting frequent labels. cues are often weakly supported by the image or even mu… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative comparison on VQA-style compositional analysis. Highlighted tech￾nique words indicate whether the mentioned composition cues match the ground truth (green) or reflect hallucinated/unsupported claims (red). COMPASS produces more grounded and composition-consistent analyses, whereas other LMM/unified baselines often repeat high-frequency concepts independent of the actual layout [PITH_FULL_IMAGE… view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative comparison on composition-guided generation. Gray annotations under captions are for intuitive illustration only and are not used as model inputs. extended baseline comparisons, an ablation showing that pixelization alone is insufficient without model-side intent grounding. Additional comparisons against baselines [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation on structural bottlenecking via pixelization. Finetuning a unified model with pixelized references alone does not reliably learn composition-intent trans￾fer, highlighting the necessity of expert-token-based intent grounding and generation￾time conditioning. transfer. This supports our central motivation: after removing semantic appear￾ance cues, the model still needs an explicit composition infe… view at source ↗
Figure 12
Figure 12. Figure 12: Failure cases of COMPASS. Common issues include orientation mismatch under the correct composition type, and occasional misclassification when perspec￾tive/defocus cues dominate perceived structure. E Ethics Considerations Dataset sourcing and licensing. Comp-11 is constructed by aggregating images from public aesthetic benchmarks (AVA [24], TAD66K [12], CADB [40], KU-PCP [20]). We will comply with the or… view at source ↗
read the original abstract

Composition is a high-level visual intent that governs where subjects are placed and how a scene is organized, yet current unified multimodal models remain unreliable at fine-grained composition recognition and struggle to turn such intent into controllable generation. We present COMPASS, the first unified multimodal framework that grounds composition-intent control in a single system spanning both composition perception and composition-guided generation, with a shared expert token $\tau_c$ as the central intent anchor. On the perception side, COMPASS injects composition expertise into an MoE backbone in a minimally invasive manner and distills the inferred intent into $\tau_c$. On the generation side, COMPASS reuses $\tau_c$ as a global conditioning signal that steers the denoising trajectory, effectively converting passive composition analysis into explicit layout control. To support systematic instruction-following composition learning and evaluation at scale, we construct Comp-11, a large-scale dataset with an 11-class taxonomy and reasoning-augmented annotations. Extensive experiments show that COMPASS substantially improves category-level composition understanding and delivers more composition-consistent, prompt-faithful generation than strong baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces COMPASS, a unified multimodal framework that grounds composition-intent control via a shared expert token τ_c. On the perception side, composition expertise is injected minimally invasively into an MoE backbone and distilled into τ_c. On the generation side, τ_c is reused directly as a global conditioning signal to steer denoising for layout control. The work also constructs the Comp-11 dataset (11-class taxonomy with reasoning-augmented annotations) and reports experiments showing gains in category-level composition understanding and more composition-consistent, prompt-faithful generation versus baselines.

Significance. If the central mechanism holds, the work offers a potentially efficient route to add explicit composition control to existing unified multimodal models without heavy architectural overhaul, addressing a noted gap in fine-grained spatial intent handling. The Comp-11 dataset could support more systematic evaluation of composition tasks. The shared-token unification across perception and generation is a distinctive framing that, if substantiated, would strengthen claims of minimal invasiveness.

major comments (1)
  1. [Generation side (abstract and method)] Generation-side description (abstract and corresponding method section): the central claim that τ_c distilled from perception can be directly reused as a global conditioning signal in the denoising trajectory to achieve explicit layout control, without additional training or architectural modifications, is load-bearing for both the 'unified' and 'minimally invasive' framing. Global tokens typically modulate semantics rather than precise spatial arrangements; the manuscript provides no explicit mechanism (e.g., cross-attention modification, spatial encoding, or auxiliary loss) or ablation demonstrating spatial transfer, leaving the weakest assumption unaddressed.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'extensive experiments' is used without naming the specific metrics, baselines, or dataset splits, which reduces immediate assessability of the reported gains.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the load-bearing assumption on the generation side. We address the comment directly below.

read point-by-point responses
  1. Referee: [Generation side (abstract and method)] Generation-side description (abstract and corresponding method section): the central claim that τ_c distilled from perception can be directly reused as a global conditioning signal in the denoising trajectory to achieve explicit layout control, without additional training or architectural modifications, is load-bearing for both the 'unified' and 'minimally invasive' framing. Global tokens typically modulate semantics rather than precise spatial arrangements; the manuscript provides no explicit mechanism (e.g., cross-attention modification, spatial encoding, or auxiliary loss) or ablation demonstrating spatial transfer, leaving the weakest assumption unaddressed.

    Authors: We agree that the current manuscript description leaves the precise integration of τ_c into the denoising process underspecified. While the abstract states that τ_c is reused as a global conditioning signal, the method section does not detail the conditioning operator, any spatial encoding, or cross-attention modifications, nor does it include an ablation isolating spatial transfer. This weakens the claims of unification and minimal invasiveness. In the revised manuscript we will (1) expand the generation-side method to specify the exact conditioning mechanism and (2) add an ablation study that measures layout consistency with and without τ_c, thereby addressing the referee's concern. revision: yes

Circularity Check

0 steps flagged

No circularity detected; no equations or derivations present

full rationale

The provided abstract and description contain no mathematical derivations, equations, fitted parameters presented as predictions, or self-citations that bear the central claim. The framework is described at a high level (MoE injection, distillation to τ_c, reuse as conditioning) without any chain that reduces a result to its inputs by construction. This matches the default expectation for papers lacking explicit derivation steps, yielding a self-contained presentation with no detectable circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details on parameters, axioms, or new entities are present in the abstract.

pith-pipeline@v0.9.1-grok · 5742 in / 1056 out tokens · 20562 ms · 2026-06-30T10:08:12.925715+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Achlioptas,P.,Ovsjanikov,M.,Haydarov,K.,Elhoseiny,M.,Guibas,L.J.:Artemis: Affective language for visual art. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11569–11579 (2021)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    arXiv preprint arXiv:2512.21675 (2025)

    Cao, S., Li, J., Li, X., Pu, Y., Zhu, K., Gao, Y., Luo, S., Xin, Y., Qin, Q., Zhou, Y., Chen, X., Zhang, W., Fu, B., Qiao, Y., Liu, Y.: UniPercept: Towards uni- fied perceptual-level image understanding across aesthetics, quality, structure, and texture. arXiv preprint arXiv:2512.21675 (2025)

  4. [4]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cao, S., Ma, N., Liu, Y., et al.: ArtiMuse: Fine-grained image aesthetics assessment with joint scoring and expert-level understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 15313–15322 (2026)

  5. [5]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team: Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818 (2024) 16 Z. Zhou et al

  6. [6]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-Pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025)

  7. [7]

    In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)

    Chen, Y.L., Huang, T.W., Chang, K.H., Tsai, Y.C., Chen, H.T., Chen, B.Y.: Quan- titativeanalysisofautomaticimagecroppingalgorithms:adatasetandcomparative study. In: IEEE Winter Conference on Applications of Computer Vision (WACV) (2017)

  8. [8]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chung, J., Hyun, S., Heo, J.P.: Style Injection in Diffusion: A training-free ap- proach for adapting large-scale diffusion models for style transfer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8795–8805 (2024)

  9. [9]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Xie, J., Zhang, D.J., Gu, Y., Mao, W., Bai, Z., Wang, W., Lin, K.Q., Chen, Z., Yang, Z., Shou, M.Z.: Emerging properties in unified multimodal pre- training. arXiv preprint arXiv:2505.14683 (2025)

  10. [10]

    SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

    Ge, Y., Zhao, S., Zhu, J., Ge, Y., Yi, K., Song, L., Li, C., Ding, X., Shan, Y.: SEED-X: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396 (2024)

  11. [11]

    arXiv preprint arXiv:2505.02471 (2025)

    Gong, B., Zou, C., Zheng, D., Yu, H., Chen, J., Sun, J., Zhao, J., Zhou, J., Ji, K., Ru, L., et al.: Ming-Lite-Uni: Advancements in unified architecture for natural multimodal interaction. arXiv preprint arXiv:2505.02471 (2025)

  12. [12]

    In: International Joint Conference on Artificial Intelligence (IJCAI)

    He, S., Zhang, Y., Xie, R., Jiang, D., Ming, A.: Rethinking image aesthetics as- sessment: Models, datasets and benchmarks. In: International Joint Conference on Artificial Intelligence (IJCAI). vol. 6, p. 22 (2022)

  13. [13]

    In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

    Hessel,J.,Holtzman,A.,Forbes,M.,LeBras,R.,Choi,Y.:CLIPScore:Areference- free evaluation metric for image captioning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. pp. 7514–7528 (2021)

  14. [14]

    In: International Conference on Neural Information Processing Systems (NIPS)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: International Conference on Neural Information Processing Systems (NIPS). vol. 30 (2017)

  15. [15]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR)

    Hong, C., Du, S., Xian, K., Lu, H., Cao, Z., Zhong, W.: Composing photos like a photographer. In: IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 7053–7062 (2021)

  16. [16]

    In: AAAI Conference on Artificial Intelligence (AAAI)

    Hong, J., Yuan, L., Gharbi, M., Fisher, M., Fatahalian, K.: Learning subject-aware cropping by outpainting professional photos. In: AAAI Conference on Artificial Intelligence (AAAI). vol. 38, pp. 2175–2183 (2024)

  17. [17]

    In: International Conference on Learning Representations (ICLR) (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

  18. [18]

    In: ACM International Conference on Multimedia

    Huang, Y., Cao, Z., Li, Q., Zhao, C., Gao, C., Yan, Z., Zuo, Y., Yang, F., Cheng, D., Tan, P., et al.: AesExpert: Towards multi-modality foundation model for image aesthetics perception. In: ACM International Conference on Multimedia. pp. 5925– 5936 (2024)

  19. [19]

    arXiv preprint arXiv:2401.08276 (2024)

    Huang, Y., Yuan, Q., Sheng, X., Yang, Z., Wu, H., Chen, P., Yang, Y., Li, L., Lin, W.: AesBench: An expert benchmark for multimodal large language models on image aesthetics perception. arXiv preprint arXiv:2401.08276 (2024)

  20. [20]

    Journal of Visual Communication and Image Representation55, 91–105 (2018) COMPASS 17

    Lee, J.T., Kim, H.U., Lee, C., Kim, C.S.: Photographic composition classification and dominant geometric element detection for outdoor scenes. Journal of Visual Communication and Image Representation55, 91–105 (2018) COMPASS 17

  21. [21]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: GLIGEN: Open- set grounded text-to-image generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 22511–22521 (2023)

  22. [22]

    Step1X-Edit: A Practical Framework for General Image Editing

    Liu, S., Han, Y., Xing, P., Yin, F., Wang, R., Cheng, W., Liao, J., Wang, Y., Fu, H., Han, C., Li, G., Peng, Y., Sun, Q., Wu, J., Cai, Y., Ge, Z., Ming, R., Xia, L., Zeng, X., Zhu, Y., Jiao, B., Zhang, X., Yu, G., Jiang, D.: Step1X-Edit: A practical framework for general image editing. arXiv preprint arXiv:2504.17761 (2025)

  23. [23]

    arXiv preprint arXiv:2411.07975 (2024)

    Ma, Y., Liu, X., Chen, X., Liu, W., Wu, C., Wu, Z., Pan, Z., Xie, Z., Zhang, H., yu, X., Zhao, L., Wang, Y., Liu, J., Ruan, C.: JanusFlow: Harmonizing autore- gression and rectified flow for unified multimodal understanding and generation. arXiv preprint arXiv:2411.07975 (2024)

  24. [24]

    In: IEE Conference on Computer Vision and Pattern Recog- nition (CVPR)

    Murray, N., Marchesotti, L., Perronnin, F.: AVA: A large-scale database for aes- thetic visual analysis. In: IEE Conference on Computer Vision and Pattern Recog- nition (CVPR). pp. 2408–2415 (2012)

  25. [25]

    In: AAAI Conference on Artificial Intelligence (AAAI)

    Su, Y., Cao, Y., Deng, J., Rao, F., Wu, Q.: Spatial-semantic collaborative cropping for user generated content. In: AAAI Conference on Artificial Intelligence (AAAI). pp. 4988 – 4997 (2024)

  26. [26]

    Qwen2.5-VL Technical Report

    Team, Q.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  27. [27]

    In: IEEE/CVF International Conference on Computer Vision (ICCV)

    Tong, S., Fan, D., Zhu, J., Xiong, Y., Chen, X., Sinha, K., Rabbat, M., LeCun, Y., Xie, S., Liu, Z.: MetaMorph: Multimodal understanding and generation via instruction tuning. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 17001–17012 (2025)

  28. [28]

    arXiv preprint arXiv:2404.02733 (2024)

    Wang, H., Wang, Q., Bai, X., Qin, Z., Chen, A.: InstantStyle: Free lunch towards style-preserving in text-to-image generation. arXiv preprint arXiv:2404.02733 (2024)

  29. [29]

    Emu3: Next-Token Prediction is All You Need

    Wang, X., Zhang, Z., Wang, W., et al.: Emu3: Next-token prediction is all you need. arXiv preprint arXiv:2409.18869 (2024)

  30. [30]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wu, C., Chen, X., Wu, Z., Ma, Y., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C., Luo, P.: Janus: Decoupling visual encoding for unified multimodal understand- ing and generation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12966–12977 (2025)

  31. [31]

    In: International Conference on Machine Learning (ICML) (2024)

    Wu, S., Fei, H., Qu, L., Ji, W., Chua, T.S.: NExT-GPT: Any-to-any multimodal LLM. In: International Conference on Machine Learning (ICML) (2024)

  32. [32]

    VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation

    Wu, Y., Zhang, Z., Chen, J., Tang, H., Li, D., Fang, Y., Zhu, L., Xie, E., Yin, H., Yi, L., et al.: VILA-U: a unified foundation model integrating visual understanding and generation. arXiv preprint arXiv:2409.04429 (2024)

  33. [33]

    In: International Conference on Learning Representations (ICLR) (2025)

    Xie, J., Mao, W., Bai, Z., Zhang, D.J., Wang, W., Lin, K.Q., Gu, Y., Chen, Z., Yang, Z., Shou, M.Z.: Show-o: One single transformer to unify multimodal under- standing and generation. In: International Conference on Learning Representations (ICLR) (2025)

  34. [34]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

    Xu, R., Xi, W., Wang, X., Mao, Y., Cheng, Z.: Stylessp: Sampling start- point enhancement for training-free diffusion-based method for style transfer. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2025)

  35. [35]

    Com- putational Visual Media8(1), 33–62 (2022)

    Xu, Y., Wei, H., Lin, M., Deng, Y., Sheng, K., Zhang, M., Tang, F., Dong, W., Huang, F., Xu, C.: Transformers in computational visual media: A survey. Com- putational Visual Media8(1), 33–62 (2022)

  36. [36]

    Com- putational Visual Media9(1), 87–107 (2023) 18 Z

    Yang, G.Y., Zhou, W.Y., Cai, Y., Zhang, S.H., Zhang, F.L.: Focusing on your subject: Deep subject-aware image composition recommendation networks. Com- putational Visual Media9(1), 87–107 (2023) 18 Z. Zhou et al

  37. [37]

    IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: IP-Adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

  38. [38]

    10197–10207 (2026)

    You, Z., Wang, K., Zhang, H., Cai, X., Gu, J., Xue, T., Dong, C., Zhang, Z.: PhotoFramer: Multi-modal image composition instruction pp. 10197–10207 (2026)

  39. [39]

    In: IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhang, J., Guo, J., Sun, S., Lou, J.G., Zhang, D.: LayoutDiffusion: Improving graphic layout generation by discrete diffusion probabilistic models. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 22490–22499 (2023)

  40. [40]

    Zhang, L., Niu, Y., Liu, W., et al.: Image composition assessment with saliency- augmented multi-pattern pooling (2021)

  41. [41]

    In: IEEE/CVF International Conference on Computer Vision (ICCV)

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: IEEE/CVF International Conference on Computer Vision (ICCV). pp. 3836–3847 (2023)

  42. [42]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhao, Z., Lu, P., Hu, Y., Zhang, A., Chen, S., Li, P., Li, X., Liu, X., Wang, L., Guo, W.: Can machines understand composition? dataset and benchmark for photographic image composition embedding and understanding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 14411– 14421 (2025)

  43. [43]

    In: International Conference on Learning Representations (ICLR) (2025)

    Zhou, C., Yu, L., Babu, A., Tirumala, K., Yasunaga, M., Shamis, L., Kahn, J., Ma, X., Zettlemoyer, L., Levy, O.: Transfusion: Predict the next token and diffuse images with one multi-modal model. In: International Conference on Learning Representations (ICLR) (2025)

  44. [44]

    arXiv preprint arXiv:2404.09619 (2024) A Multi-expert Voting Protocol We employ a multi-expert voting protocol to produce consensus aesthetic scores for the source poolO

    Zhou, Z., Wang, Q., Lin, B., Su, Y., Chen, R., Tao, X., Zheng, A., Yuan, L., Wan, P., Zhang, D.: UNIAA: A unified multi-modal image aesthetic assessment baseline and benchmark. arXiv preprint arXiv:2404.09619 (2024)

  45. [45]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Tian, H., Gao, Z., Liu, Y., Li, H., Chen, D., Jiang, T., et al.: InternVL3: Ex- ploringadvancedtrainingandtest-timerecipesforopen-sourcemultimodalmodels. arXiv preprint arXiv:2504.10479 (2025) COMPASS 19 Appendix This appendix provides additional details that are omitted from the main paper due to space constraints, including: (i) taxonomy defini...