pith. sign in

arxiv: 2606.31924 · v1 · pith:4XGH5EQFnew · submitted 2026-06-30 · 💻 cs.CV

InstanceControl: Controllable Complex Image Generation without Instance Labeling

Pith reviewed 2026-07-01 05:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords controllable image generationmulti-instance scenesvision-language modelsinstance masksadaptive mask refinementdiffusion modelsattribute confusionlabel-free control
0
0 comments X

The pith

A vision-language model can automatically match text instance descriptions to regions in visual conditions, enabling precise multi-object image control without manual labeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that controllable generation of complex scenes containing multiple distinct objects can succeed without the usual requirement of hand-labeled instance masks. Existing methods suffer from attribute confusion because they cannot reliably link specific text descriptions to the right parts of inputs such as depth or edge maps. InstanceControl uses a VLM to extract instance descriptions from the prompt and to predict matching masks from the visual conditions, then refines those masks adaptively as generation proceeds. If this works, users could produce detailed, correctly attributed scenes from ordinary prompts and conditions alone.

Core claim

InstanceControl establishes instance-level correspondences by having the VLM parse descriptions from the text prompt and predict instance masks from the visual conditions, then applies adaptive mask refinement during the diffusion process to handle prediction noise, resulting in accurate control over multiple instances in generated images.

What carries the argument

VLM-driven automatic parsing of instance descriptions paired with mask prediction from visual conditions, followed by adaptive refinement of those masks inside the generation loop.

If this is right

  • Multi-instance scenes can be generated from standard text prompts and visual conditions without any instance-level annotation step.
  • Attribute confusion between objects decreases relative to prior controllable diffusion methods that lack explicit instance association.
  • The same visual condition inputs used by ControlNet-style models become sufficient for precise per-object control once VLM parsing is added.
  • Generation quality measured by fidelity and instance accuracy improves over current state-of-the-art approaches on complex scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same parsing-plus-refinement pattern could be tested on video or 3D generation tasks that also require region-text alignment.
  • If future VLMs produce cleaner masks, the adaptive refinement module might be simplified or removed while retaining performance.
  • Downstream applications such as interactive scene editing could adopt the method to let users specify object properties via text alone.

Load-bearing premise

The VLM can reliably parse instance descriptions from prompts and predict accurate instance masks from visual conditions, with the adaptive refinement sufficient to handle any noise in those predictions.

What would settle it

Run the method on a collection of prompts describing scenes with many similar or overlapping objects; if attribute swaps such as color or identity between instances remain frequent in the outputs, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.31924 by Fan Li, Huan Wang, Jiaqi Xu, Ming Liu, Wangmeng Zuo, Xiaoyu Liu, Zhixin Wang.

Figure 1
Figure 1. Figure 1: Our proposed InstanceControl achieves fine-grained control over instance at￾tributes in complex multi-instance scenarios. In contrast, FLUX ControlNet [18] often struggles with attribute confusion. Incorrect instances are marked with red boxes, and the corresponding instance descriptions are also highlighted in the prompt. Abstract. Controllable image generation methods, such as ControlNet, have demonstrat… view at source ↗
Figure 2
Figure 2. Figure 2: The proposed InstanceControl framework operates in two stages: instance-level text-visual condition association and instance-aware controllable generation. In the first stage, a VLM is used to establish instance-level correspondences C between text prompts and visual conditions. To mitigate noise in predicted masks, the second stage introduces a mask refinement module that adjusts mpred i to mrfn i based o… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons in multi-instance scenarios across various image prompts and visual conditions. Our InstanceControl achieves significantly finer at￾tribute control for each instance compared to existing methods. Implementation Details. Our training process consists of two primary stages. In the first stage, we employ the pretrained Sa2VA [59] as our backbone. Low￾Rank Adaptation (LoRA) modules, wit… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of learned correspondences between text prompts and visual con￾ditions, alongside generated images under different instance numbers [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of the interactive mask correction [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Controllable image generation methods, such as ControlNet, have demonstrated a remarkable capacity to introduce visual conditions(e.g., depth maps) to guide image generation. However, these methods often struggle with complex multi-instance scenes, frequently leading to attribute confusion among instances. While recent approaches attempt to mitigate this via manual instance labeling, such requirements are labor-intensive. In this paper, we propose InstanceControl, a novel multi-instance controllable generation method that eliminates the need for instance labeling. We identify the primary bottleneck in existing methods as the inability to accurately associate instance descriptions with their corresponding regions within visual conditions. To address this, we leverage the Vision-Language Model (VLM) to establish instance-level correspondences between text prompts and visual conditions. Specifically, the VLM automatically parses instance descriptions from the text prompts and simultaneously predicts instance masks based on the visual conditions. Furthermore, since the predicted masks may contain noise, we introduce an adaptive mask refinement strategy that dynamically refines these instance masks during the generation process. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods, achieving superior fidelity and precise instance-level control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes InstanceControl, a method for multi-instance controllable image generation (e.g., using depth or Canny conditions) that avoids manual instance labeling. It relies on a VLM to automatically parse instance descriptions from text prompts and predict corresponding instance masks directly from the visual conditions, followed by an adaptive mask refinement step during the diffusion process. The authors claim this yields superior fidelity and precise instance-level control compared to prior state-of-the-art methods.

Significance. If the core technical claims are substantiated, the work would address a practical limitation in controllable generation for complex scenes by removing labor-intensive labeling. The use of VLMs to establish text-to-condition correspondences is a conceptually interesting direction, and the adaptive refinement is a reasonable engineering response to prediction noise. However, the absence of supporting validation for the VLM step on non-RGB inputs substantially reduces the assessed significance.

major comments (2)
  1. [Abstract] Abstract: The central claim that the VLM 'simultaneously predicts instance masks based on the visual conditions' (depth maps, edges, etc.) is load-bearing for the no-labeling contribution, yet the manuscript provides no quantitative mask-quality metrics, ablation studies, or fine-tuning details demonstrating reliable performance on inputs outside standard VLM RGB training distributions.
  2. [Method] Method description: The adaptive mask refinement strategy is presented as sufficient to correct noise in VLM predictions, but without concrete implementation details, equations, or ablation results quantifying its effect on final instance control accuracy, it is impossible to verify whether misalignment in the initial masks is actually resolved.
minor comments (1)
  1. [Abstract] Abstract: Typo/missing space in 'conditions(e.g., depth maps)'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional evidence and detail would strengthen the presentation of our claims. We address each point below and will revise the manuscript to incorporate the requested validation and implementation specifics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the VLM 'simultaneously predicts instance masks based on the visual conditions' (depth maps, edges, etc.) is load-bearing for the no-labeling contribution, yet the manuscript provides no quantitative mask-quality metrics, ablation studies, or fine-tuning details demonstrating reliable performance on inputs outside standard VLM RGB training distributions.

    Authors: We agree that the manuscript lacks direct quantitative support for VLM mask prediction on non-RGB inputs. The current evaluation focuses on end-to-end generation quality rather than isolated mask metrics. In the revision we will add a dedicated evaluation subsection reporting mask-quality metrics (e.g., mean IoU against human-annotated references) for depth and Canny conditions on a held-out set, plus an ablation that measures the impact of removing the VLM correspondence step. We will also explicitly state that the VLM is used zero-shot without fine-tuning. These additions will directly substantiate the load-bearing claim. revision: yes

  2. Referee: [Method] Method description: The adaptive mask refinement strategy is presented as sufficient to correct noise in VLM predictions, but without concrete implementation details, equations, or ablation results quantifying its effect on final instance control accuracy, it is impossible to verify whether misalignment in the initial masks is actually resolved.

    Authors: We concur that the current description of adaptive mask refinement is insufficiently detailed. The revision will expand the method section with the exact algorithmic procedure, including the mathematical formulation for timestep-dependent mask updating (the probability adjustment rule and the stopping criterion), pseudocode, and a new ablation table that isolates the refinement module's contribution to instance-level metrics such as attribute-binding accuracy and mask alignment error. This will enable verification that initial misalignments are resolved. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with external VLM component

full rationale

The paper describes an empirical pipeline that invokes an off-the-shelf VLM to parse instance descriptions and predict masks from control signals, then applies adaptive refinement. No equations, fitted parameters, or derivations are present. No self-citations are invoked as load-bearing uniqueness theorems, and no input is renamed as a prediction. The central claim therefore rests on the (unverified here) performance of the external VLM rather than reducing to a self-referential definition or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract; this is an applied machine learning method relying on existing VLM capabilities.

pith-pipeline@v0.9.1-grok · 5734 in / 974 out tokens · 23764 ms · 2026-07-01T05:35:55.222589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 30 canonical work pages · 14 internal anchors

  1. [1]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)

  2. [2]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  3. [3]

    arXiv preprint arXiv:2312.03079 (2023)

    Bhat, S.F., Mitra, N.J., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depth conditioning. arXiv preprint arXiv:2312.03079 (2023)

  4. [4]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 24185–24198 (2024) 16 X. Liu et al

  6. [6]

    Advances in neural information processing systems37, 128886–128910 (2024)

    Cheng, B., Ma, Y., Wu, L., Liu, S., Ma, A., Wu, X., Leng, D., Yin, Y.: Hico: Hierarchical controllable diffusion model for layout-to-image generation. Advances in neural information processing systems37, 128886–128910 (2024)

  7. [7]

    In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Choi, H., Kasahara, I., Engin, S., Graule, M.A., Chavan-Dafle, N., Isler, V.: Finecontrolnet: Fine-level text control for image generation with spatially aligned text control injection. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 3975–3984. IEEE (2025)

  8. [8]

    arXiv preprint arXiv:2502.10451 (2025)

    Fang, Z., Xiang, L., Cai, X., Zhou, K., Wen, H.: Flexcontrol: Computation-aware controlnet with differentiable router for text-to-image generation. arXiv preprint arXiv:2502.10451 (2025)

  9. [9]

    Google: Nano banana.https://gemini.google/overview/image- generation (2025)

  10. [10]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024)

  11. [11]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  12. [12]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hu, L.: Animate anyone: Consistent and controllable image-to-video synthesis for character animation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8153–8163 (2024)

  13. [13]

    arXiv preprint arXiv:2306.00964 (2023)

    Hu, M., Zheng, J., Liu, D., Zheng, C., Wang, C., Tao, D., Cham, T.J.: Cock- tail: Mixing multi-modality controls for text-conditional image generation. arXiv preprint arXiv:2306.00964 (2023)

  14. [14]

    co / InstantX / Qwen-Image-ControlNet-Union(2025)

    InstantX: Qwen-image-controlnet-union.https : / / huggingface . co / InstantX / Qwen-Image-ControlNet-Union(2025)

  15. [15]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4015–4026 (2023)

  16. [16]

    Advances in neural information processing systems36, 36652–36663 (2023)

    Kirstain,Y.,Polyak,A.,Singer,U.,Matiana,S.,Penna,J.,Levy,O.:Pick-a-pic:An open dataset of user preferences for text-to-image generation. Advances in neural information processing systems36, 36652–36663 (2023)

  17. [17]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Koley, S., Bhunia, A.K., Sekhri, D., Sain, A., Chowdhury, P.N., Xiang, T., Song, Y.Z.: It’s all about your sketch: Democratising sketch control in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7204–7214 (2024)

  18. [18]

    Labs, B.F.: Flux.https://github.com/black-forest-labs/flux(2024)

  19. [19]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning seg- mentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024)

  20. [20]

    arXiv preprint arXiv:2509.19282 (2025)

    Li, B., Wang, C.Y., Xu, H., Zhang, X., Armand, E., Srivastava, D., Shan, X., Chen, Z., Xie, J., Tu, Z.: Overlaybench: A benchmark for layout-to-image generation with dense overlaps. arXiv preprint arXiv:2509.19282 (2025)

  21. [21]

    arXiv preprint arXiv:2506.00596 (2025)

    Li, D., Zhang, H., Wang, S., Li, J., Wu, Z.: Seg2any: Open-set segmentation- mask-to-image generation with precise shape and semantic control. arXiv preprint arXiv:2506.00596 (2025)

  22. [22]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, Z., Yang, B., Liu, Q., Zhang, S., Ma, Z., Yin, L., Deng, L., Sun, Y., Liu, Y., Bai, X.: Lira: Inferring segmentation in large multi-modal models with local interleaved region assistance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 24056–24067 (2025) InstanceControl 17

  23. [23]

    UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

    Lin, B., Li, Z., Cheng, X., Niu, Y., Ye, Y., He, X., Yuan, S., Yu, W., Wang, S., Ge, Y., et al.: Uniworld-v1: High-resolution semantic encoders for unified visual understanding and generation. arXiv preprint arXiv:2506.03147 (2025)

  24. [24]

    In: Proceedings of the 2025 CHI Conference on Human Factors in Com- puting Systems

    Lin, D.C.E., Kang, H.B., Martelaro, N., Kittur, A., Chen, Y.Y., Hong, M.K.: Inkspire: supporting design exploration with generative ai through analogical sketching. In: Proceedings of the 2025 CHI Conference on Human Factors in Com- puting Systems. pp. 1–18 (2025)

  25. [25]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025)

  26. [26]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

  27. [27]

    In: NeurIPS (2023)

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023)

  28. [28]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  29. [29]

    In: European Conference on Computer Vision

    Liu, X., Wei, Y., Liu, M., Lin, X., Ren, P., Xie, X., Zuo, W.: Smartcontrol: En- hancing controlnet for handling rough visual conditions. In: European Conference on Computer Vision. pp. 1–17. Springer (2024)

  30. [30]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025)

  31. [31]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  32. [32]

    Controlnext: Powerful and effi- cient control for image and video generation.arXiv preprint arXiv:2408.06070, 2024

    Peng, B., Wang, J., Zhang, Y., Li, W., Yang, M.C., Jia, J.: Controlnext: Powerful and efficient control for image and video generation. arXiv preprint arXiv:2408.06070 (2024)

  33. [33]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023)

  34. [34]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Pi, R., Gao, J., Diao, S., Pan, R., Dong, H., Zhang, J., Yao, L., Han, J., Xu, H., Kong, L., et al.: Detgpt: Detect what you need via reasoning. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14172–14189 (2023)

  35. [35]

    In: The Twelfth International Conference on Learning Representations

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., Müller, J., Penna, J., Rombach, R.: Sdxl: Improving latent diffusion models for high-resolution image synthesis. In: The Twelfth International Conference on Learning Representations

  36. [36]

    arXiv preprint arXiv:2305.11147 (2023)

    Qin, C., Zhang, S., Yu, N., Feng, Y., Yang, X., Zhou, Y., Wang, H., Niebles, J.C., Xiong, C., Savarese, S., et al.: Unicontrol: A unified diffusion model for controllable visual generation in the wild. arXiv preprint arXiv:2305.11147 (2023)

  37. [37]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  38. [38]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rasheed, H., Maaz, M., Shaji, S., Shaker, A., Khan, S., Cholakkal, H., Anwer, R.M., Xing, E., Yang, M.H., Khan, F.S.: Glamm: Pixel grounding large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13009–13018 (2024) 18 X. Liu et al

  39. [39]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Ren, Z., Huang, Z., Wei, Y., Zhao, Y., Fu, D., Feng, J., Jin, X.: Pixellm: Pixel rea- soning with large multimodal model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26374–26383 (2024)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  41. [41]

    arXiv preprint arXiv:2511.18333 (2025)

    Shi, X., Li, B., Han, X., Cai, Z., Yang, L., Lin, D., Wang, Q.: Consistcom- pose: Unified multimodal layout control for image composition. arXiv preprint arXiv:2511.18333 (2025)

  42. [42]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Shin, C., Choi, J., Kim, H., Yoon, S.: Large-scale text-to-image model with inpaint- ing is a zero-shot subject-driven image generator. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 7986–7996 (2025)

  43. [43]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Tan, Z., Liu, S., Yang, X., Xue, Q., Wang, X.: Ominicontrol: Minimal and universal control for diffusion transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 14940–14950 (2025)

  44. [44]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  45. [45]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wang, T., Cheng, C., Wang, L., Chen, S., Zhao, W.: Himtok: Learning hierarchical mask tokens for image segmentation with large multimodal model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23267–23278 (2025)

  46. [46]

    arXiv preprint arXiv:2404.08506 (2024)

    Wei, C., Tan, H., Zhong, Y., Yang, Y., Ma, L.: Lasagna: Language-based segmen- tation assistant for complex queries. arXiv preprint arXiv:2404.08506 (2024)

  47. [47]

    arXiv preprint arXiv:2411.17606 (2024)

    Wei, C., Zhong, Y., Tan, H., Liu, Y., Zhao, Z., Hu, J., Yang, Y.: Hyperseg: To- wards universal visual segmentation with large language model. arXiv preprint arXiv:2411.17606 (2024)

  48. [48]

    Qwen-Image Technical Report

    Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025)

  49. [49]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)

  50. [50]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Xia, Z., Han, D., Han, Y., Pan, X., Song, S., Huang, G.: Gsva: Generalized segmen- tation via multimodal large language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3858–3869 (2024)

  51. [51]

    arXiv preprint arXiv:2505.05071 (2025)

    Xie, C., Wang, B., Kong, F., Li, J., Liang, D., Zhang, G., Leng, D., Yin, Y.: Fg-clip: Fine-grained visual and textual alignment. arXiv preprint arXiv:2505.05071 (2025)

  52. [52]

    In: Proceedings of the IEEE international conference on computer vision

    Xie, S., Tu, Z.: Holistically-nested edge detection. In: Proceedings of the IEEE international conference on computer vision. pp. 1395–1403 (2015)

  53. [53]

    Advances in Neural Information Processing Systems36, 15903–15935 (2023)

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems36, 15903–15935 (2023)

  54. [54]

    In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)

  55. [55]

    Lisa++: An improved baseline for reasoning segmentation with large language model,

    Yang, S., Qu, T., Lai, X., Tian, Z., Peng, B., Liu, S., Jia, J.: Lisa++: An improved baseline for reasoning segmentation with large language model. arXiv preprint arXiv:2312.17240 (2023) InstanceControl 19

  56. [56]

    Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., et al.: R1-onevision: Advancing generalized multimodal reasoning throughcross-modalformalization.In:ProceedingsoftheIEEE/CVFInternational Conference on Computer Vision. pp. 2376–2385 (2025)

  57. [57]

    Ferret: Refer and Ground Anything Anywhere at Any Granularity

    You, H., Zhang, H., Gan, Z., Du, X., Zhang, B., Wang, Z., Cao, L., Chang, S.F., Yang, Y.: Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704 (2023)

  58. [58]

    arXiv preprint arXiv:2506.22624 (2025)

    You, Z., Wu, Z.: Seg-r1: Segmentation can be surprisingly simple with reinforce- ment learning. arXiv preprint arXiv:2506.22624 (2025)

  59. [59]

    Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

    Yuan, H., Li, X., Zhang, T., Sun, Y., Huang, Z., Xu, S., Ji, S., Tong, Y., Qi, L., Feng, J., et al.: Sa2va: Marrying sam2 with llava for dense grounded understanding of images and videos. arXiv preprint arXiv:2501.04001 (2025)

  60. [60]

    arXiv preprint arXiv:2312.06573 (2023)

    Zavadski, D., Feiden, J.F., Rother, C.: Controlnet-xs: Designing an efficient and effective architecture for controlling text-to-image diffusion models. arXiv preprint arXiv:2312.06573 (2023)

  61. [61]

    In: Proceedings of the 7th ACM International Conference on Multimedia in Asia

    Zhang, H., Duan, Z., Wang, X., Chen, Y., Zhang, Y.: Eligen: Entity-level con- trolled image generation with regional attention. In: Proceedings of the 7th ACM International Conference on Multimedia in Asia. pp. 1–7 (2025)

  62. [62]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang,H.,Hong,D.,Wang,Y.,Shao,J.,Wu,X.,Wu,Z.,Jiang,Y.G.:Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 18487–18497 (2025)

  63. [63]

    arXiv preprint arXiv:2505.19114 (2025)

    Zhang, H., Hong, D., Yang, M., Cheng, Y., Zhang, Z., Shao, J., Wu, X., Wu, Z., Jiang, Y.G.: Creatidesign: A unified multi-conditional diffusion transformer for creative graphic design. arXiv preprint arXiv:2505.19114 (2025)

  64. [64]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  65. [65]

    Advances in neural information processing systems37, 71737–71767 (2024)

    Zhang, T., Li, X., Fei, H., Yuan, H., Wu, S., Ji, S., Loy, C.C., Yan, S.: Omg- llava: Bridging image-level, object-level, pixel-level reasoning and understanding. Advances in neural information processing systems37, 71737–71767 (2024)

  66. [66]

    In: European Conference on Computer Vision

    Zhang, Z., Ma, Y., Zhang, E., Bai, X.: Psalm: Pixelwise segmentation with large multi-modal model. In: European Conference on Computer Vision. pp. 74–91. Springer (2024)

  67. [67]

    arXiv preprint arXiv:2503.12885 (2025)

    Zhou, D., Li, M., Yang, Z., Yang, Y.: Dreamrenderer: Taming multi- instance attribute control in large-scale text-to-image models. arXiv preprint arXiv:2503.12885 (2025)

  68. [68]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

  69. [69]

    In: European Conference on Computer Vision (ECCV) (2024)

    Zhu, S., Chen, J.L., Dai, Z., Xu, Y., Cao, X., Yao, Y., Zhu, H., Zhu, S.: Champ: Controllable and consistent human image animation with 3d parametric guidance. In: European Conference on Computer Vision (ECCV) (2024)