pith. sign in

arxiv: 2606.05031 · v1 · pith:OPPN4AK6new · submitted 2026-06-03 · 💻 cs.CV

MetaPoint: Unlocking Precise Spatial Control in Agentic Visual Generation

Pith reviewed 2026-06-28 06:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords MetaPointspatial controlgenerative visual modelspositional encodingagentic generationcompositional tokenspixel-level controlbounding box
0
0 comments X

The pith

MetaPoint represents any 2D coordinate as a single special token that generative models interpret as a virtual point on the image canvas through their existing positional encodings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generative visual models can follow textual descriptions of space but cannot directly translate numerical coordinates into exact placements on the 2D canvas. MetaPoint bridges the gap by encoding each continuous coordinate as one special token that the model treats as a point using its built-in positional encodings. This requires no new architecture, training, or attention masking. The tokens are compositional, so a planner agent can break high-level requests into sequences of these primitives for the generator. A reader would care because the approach promises pixel-level object positioning with one token and bounding-box control with two.

Core claim

The paper claims that MetaPoint bridges the disconnect between textual spatial descriptions and numerical coordinates by representing each continuous 2D position as a single special token. The token leverages the model's inherent positional encoding to act as a virtual point on the canvas. This enables one token for pixel-level object positioning and two for bounding boxes. The tokens are compositional spatial primitives that allow planner agents to structure high-level requests into sequences for the generator, all without architectural modifications.

What carries the argument

The MetaPoint token, a special token that represents a continuous 2D coordinate and is interpreted as a virtual point on the canvas via the model's positional encoding schemes.

If this is right

  • Pixel-level control of an object's position is possible with one MetaPoint token.
  • A bounding box is specified with two MetaPoint tokens.
  • A planner agent can decompose a high-level user request into a structured sequence of these spatial primitives.
  • The approach supports intuitive interactive editing systems without model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token mechanism could be tested for temporal control in video generation if positional encodings extend across frames.
  • Interactive interfaces might let users directly manipulate MetaPoint tokens in a canvas to refine generated outputs in real time.
  • Compositional use might allow agents to handle overlapping or relative spatial arrangements beyond isolated positions and boxes.

Load-bearing premise

The model's inherent positional encoding schemes can interpret the introduced MetaPoint tokens as virtual points on the canvas without any architectural changes, additional training, or bespoke attention masking.

What would settle it

Prompt a model with a MetaPoint token for a precise pixel coordinate such as (128, 256) and check whether the generated object appears exactly at that location rather than shifted, ignored, or placed according to text alone.

read the original abstract

Generative visual models fundamentally struggle with precise spatial control. This arises from a core disconnect: models can process textual descriptions of space but cannot directly map numerical coordinates onto the 2D image canvas. We introduce MetaPoint, a method that bridges this gap by representing a continuous 2D coordinate as a single, special token. Crucially, MetaPoint requires no new architectural components; it directly leverages the model's inherent positional encoding schemes to interpret these coordinates, treating our token as a virtual point on the canvas. This lightweight approach enables pixel-level control of an object's position with one token or its bounding box with two, all without requiring architectural changes or bespoke attention masking. The MetaPoint tokens are designed to be compositional, serving as spatial primitives. This allows a planner agent to decompose a high-level user request into a structured sequence of primitives for the generator. By providing a simple, precise, and scalable building block for spatial control, MetaPoint unlocks more powerful compositional generative agents and enables intuitive, interactive editing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MetaPoint, a method that represents a continuous 2D coordinate as a single special token in generative visual models. It claims this token is interpreted by the model's existing positional encoding schemes as a virtual point on the canvas, enabling pixel-level control of object position with one token or bounding box with two tokens, without architectural changes, additional training, or bespoke attention masking. The tokens are compositional spatial primitives intended for use by planner agents in decomposing high-level requests for the generator.

Significance. If the central claim holds, MetaPoint would supply a lightweight, parameter-free building block for precise spatial control in visual generation. This could meaningfully advance agentic systems by allowing structured decomposition of spatial instructions and support interactive editing without retraining or complex masking schemes.

major comments (2)
  1. [Abstract] Abstract: the central claim that MetaPoint tokens enable pixel-level control by directly leveraging inherent positional encodings without architectural changes is unsupported; no equations, derivations, or implementation details are supplied to show how an arbitrary continuous coordinate is injected into standard discrete positional encodings (sinusoidal, rotary, or learned) while remaining a single token.
  2. [Abstract] Abstract: the assumption that the model's positional encoding can natively decode MetaPoint tokens as virtual continuous points on the 2D canvas is load-bearing for the 'no changes' guarantee, yet the text provides no mechanism, ablation, or empirical verification that this mapping preserves precision or works across model families.
minor comments (1)
  1. The manuscript contains no quantitative results, error analysis, or comparisons, which limits assessment of the claimed pixel-level precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. The concerns about missing mechanistic details are valid, and we will revise the manuscript to address them directly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that MetaPoint tokens enable pixel-level control by directly leveraging inherent positional encodings without architectural changes is unsupported; no equations, derivations, or implementation details are supplied to show how an arbitrary continuous coordinate is injected into standard discrete positional encodings (sinusoidal, rotary, or learned) while remaining a single token.

    Authors: We agree the abstract lacks the requested equations and derivation. In the revised manuscript we will add a concise paragraph (with the core mapping equation) to the abstract that shows how a continuous (x, y) pair is converted to a single special token whose embedding is positioned via the model's existing sinusoidal/rotary/learned encoding; the token remains a single vocabulary item and requires no architectural modification or custom masking. revision: yes

  2. Referee: [Abstract] Abstract: the assumption that the model's positional encoding can natively decode MetaPoint tokens as virtual continuous points on the 2D canvas is load-bearing for the 'no changes' guarantee, yet the text provides no mechanism, ablation, or empirical verification that this mapping preserves precision or works across model families.

    Authors: The current text indeed supplies no explicit mechanism, ablation, or cross-family verification. We will expand the methods section with the precise token-to-position mapping, add an ablation on coordinate precision, and include a short cross-model experiment (e.g., on both sinusoidal and rotary models) demonstrating that the same single-token construction yields pixel-level control without retraining or masking changes. revision: yes

Circularity Check

0 steps flagged

No circularity: no derivation, equations, or fitted quantities present

full rationale

The paper presents MetaPoint as a lightweight token-based method that reuses existing positional encodings without architectural changes. No equations, predictions, self-citations as load-bearing premises, or fitted parameters appear in the provided text. The claim is a design proposal rather than a reduction of outputs to inputs by construction, so none of the enumerated circularity patterns apply.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5726 in / 990 out tokens · 38619 ms · 2026-06-28T06:04:01.332167+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

99 extracted references · 17 linked inside Pith

  1. [1]

    Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026

    Ruichuan An, Sihan Yang, Ziyu Guo, Wei Dai, Zijun Shen, Haodong Li, Renrui Zhang, Xinyu Wei, Guopeng Li, Wenshan Wu, et al. Genius: Generative fluid intelligence evaluation suite.arXiv preprint arXiv:2602.11144, 2026. 4

  2. [2]

    Stitch: Training-free position control in multimodal diffusion transformers.arXiv preprint arXiv:2509.26644, 2025

    Jessica Bader, Mateusz Pach, Maria A Bravo, Serge Belongie, and Zeynep Akata. Stitch: Training-free position control in multimodal diffusion transformers.arXiv preprint arXiv:2509.26644, 2025. 4

  3. [3]

    Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025. 9

  4. [4]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025. 9

  5. [5]

    Lamic: Layout-aware multi-image composition via scalability of multimodal diffusion transformer.arXiv preprint arXiv:2508.00477,

    Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, and Weiming Zhang. Lamic: Layout-aware multi-image composition via scalability of multimodal diffusion transformer.arXiv preprint arXiv:2508.00477,

  6. [6]

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7

  7. [7]

    Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683,

  8. [8]

    1, 2, 4, 5, 7, 9, 10, 18

  9. [9]

    Muses: 3d-controllable image generation via multi-modal agent collaboration

    Yanbo Ding, Shaobin Zhuang, Kunchang Li, Zhengrong Yue, Yu Qiao, and Yali Wang. Muses: 3d-controllable image generation via multi-modal agent collaboration. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 2753–2761, 2025. 4

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024. 9

  11. [11]

    Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv:2503.10639, 2025

    Rongyao Fang, Chengqi Duan, Kun Wang, Linjiang Huang, Hao Li, Shilin Yan, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, et al. Got: Unleashing reasoning capability of multimodal large language model for visual generation and editing.arXiv:2503.10639, 2025. 4

  12. [12]

    Layoutgpt: Compositional visual planning and generation with large language models

    Weixi Feng, Wanrong Zhu, Tsu-jui Fu, Varun Jampani, Arjun Akula, Xuehai He, Sugato Basu, Xin Eric Wang, and William Yang Wang. Layoutgpt: Compositional visual planning and generation with large language models. Advances in Neural Information Processing Systems, 36:18225–18250, 2023. 1

  13. [13]

    Geneval: An object-focused framework for evaluating text-to-image alignment

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS, 2023. 4

  14. [14]

    Imagen 4

    Google. Imagen 4. https://labs.google/fx/tools/image-fx, 2025. 9

  15. [15]

    Nano banana

    Google. Nano banana. https://gemini.google/overview/image-generation/, 2025. 1, 2, 3, 4, 5, 9

  16. [16]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025. 6, 7, 16

  17. [17]

    Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024. 4

  18. [18]

    T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36: 78723–78747, 2023

    Kaiyi Huang, Kaiyue Sun, Enze Xie, Zhenguo Li, and Xihui Liu. T2i-compbench: A comprehensive benchmark for open-world compositional text-to-image generation.Advances in Neural Information Processing Systems, 36: 78723–78747, 2023. 4

  19. [19]

    Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model

    Runhui Huang, Kaixin Cai, Jianhua Han, Xiaodan Liang, Renjing Pei, Guansong Lu, Songcen Xu, Wei Zhang, and Hang Xu. Layerdiff: Exploring text-guided multi-layered composable image synthesis via layer-collaborative diffusion model. InEuropean Conference on Computer Vision, pages 144–160. Springer, 2024. 4

  20. [20]

    T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025

    Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, and Hongsheng Li. T2i-r1: Reinforcing image generation with collaborative semantic-level and token-level cot.arXiv preprint arXiv:2505.00703, 2025. 4

  21. [21]

    Srum: Fine-grained self-rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025

    Weiyang Jin, Yuwei Niu, Jiaqi Liao, Chengqi Duan, Aoxue Li, Shenghua Gao, and Xihui Liu. Srum: Fine-grained self-rewarding for unified multimodal models.arXiv preprint arXiv:2510.12784, 2025. 4

  22. [22]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. InICCV, pages 4015–4026, 2023. 7

  23. [23]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025. 2, 9, 10, 18

  24. [24]

    Groundit: Grounding diffusion transformers via noisy patch transplantation.Advances in Neural Information Processing Systems, 37:58610–58636, 2024

    Yuseung Lee, Taehoon Yoon, and Minhyuk Sung. Groundit: Grounding diffusion transformers via noisy patch transplantation.Advances in Neural Information Processing Systems, 37:58610–58636, 2024. 3, 8

  25. [25]

    Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. Onecat: Decoder-only auto-regressive model for unified understanding and generation.arXiv preprint arXiv:2509.03498, 2025. 2 12

  26. [26]

    Coco: Code as cot for text-to-image preview and rare concept generation

    Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, et al. Coco: Code as cot for text-to-image preview and rare concept generation. arXiv preprint arXiv:2603.08652, 2026. 4

  27. [27]

    Gir-bench: Versatile benchmark for generating images with reasoning.arXiv preprint arXiv:2510.11026, 2025

    Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Long Chen. Gir-bench: Versatile benchmark for generating images with reasoning.arXiv preprint arXiv:2510.11026, 2025. 4

  28. [28]

    Easier painting than thinking: Can text-to-image models set the stage, but not direct the play?arXiv preprint arXiv:2509.03516, 2025

    Ouxiang Li, Yuan Wang, Xinting Hu, Huijuan Huang, Rui Chen, Jiarong Ou, Xin Tao, Pengfei Wan, and Fuli Feng. Easier painting than thinking: Can text-to-image models set the stage, but not direct the play?arXiv preprint arXiv:2509.03516, 2025. 3, 7

  29. [29]

    Gligen: Open-set grounded text-to-image generation

    Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jianwei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22511–22521, 2023. 2, 3, 8, 17

  30. [30]

    Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025

    Long Lian, Yifan Ding, Yunhao Ge, Sifei Liu, Hanzi Mao, Boyi Li, Marco Pavone, Ming-Yu Liu, Trevor Darrell, Adam Yala, et al. Describe anything: Detailed localized image and video captioning.arXiv preprint arXiv:2504.16072, 2025. 7

  31. [31]

    Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025

    Chao Liao, Liyang Liu, Xun Wang, Zhengxiong Luo, Xinyu Zhang, Wenliang Zhao, Jie Wu, Liang Li, Zhi Tian, and Weilin Huang. Mogao: An omni foundation model for interleaved multi-modal generation.arXiv preprint arXiv:2505.05472, 2025. 2, 5

  32. [32]

    Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning

    Jiaqi Liao, Zhengyuan Yang, Linjie Li, Dianqi Li, Kevin Lin, Yu Cheng, and Lijuan Wang. Imagegen-cot: Enhancing text-to-image in-context learning with chain-of-thought reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17214–17223, 2025. 4

  33. [33]

    Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

    Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025. 10

  34. [34]

    Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761,

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761,

  35. [35]

    Tf-icon: Diffusion-based training-free cross-domain image composition

    Shilin Lu, Yanzhu Liu, and Adams Wai-Kin Kong. Tf-icon: Diffusion-based training-free cross-domain image composition. InICCV, 2023. 4

  36. [36]

    Mace: Mass concept erasure in diffusion models.CVPR, 2024

    Shilin Lu, Zilan Wang, Leyang Li, Yanzhu Liu, and Adams Wai-Kin Kong. Mace: Mass concept erasure in diffusion models.CVPR, 2024

  37. [37]

    Robust watermarking using generative priors against image editing: From benchmarking to advances.arXiv preprint arXiv:2410.18775, 2024

    Shilin Lu, Zihan Zhou, Jiayou Lu, Yuanzhi Zhu, and Adams Wai-Kin Kong. Robust watermarking using generative priors against image editing: From benchmarking to advances.arXiv preprint arXiv:2410.18775, 2024

  38. [38]

    Does flux already know how to perform physically plausible image composition?arXiv preprint arXiv:2509.21278, 2025

    Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, and Adams Wai-Kin Kong. Does flux already know how to perform physically plausible image composition?arXiv preprint arXiv:2509.21278, 2025. 4

  39. [39]

    Does understanding inform generation in unified multimodal models? from analysis to path forward

    Yuwei Niu, Weiyang Jin, Jiaqi Liao, Chaoran Feng, Peng Jin, Bin Lin, Zongjian Li, Bin Zhu, Weihao Yu, and Li Yuan. Does understanding inform generation in unified multimodal models? from analysis to path forward. arXiv preprint arXiv:2511.20561, 2025. 4

  40. [40]

    Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

    Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

  41. [41]

    Gpt-4o.https://openai.com/index/introducing-4o-image-generation/, 2025

    OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation/, 2025. 1, 2, 3, 5, 9

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, pages 8748–8763, 2021. 4

  43. [43]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 4 13

  44. [44]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. 4, 5

  45. [45]

    Omost github page, 2024

    Omost Team. Omost github page, 2024. 1

  46. [46]

    Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025

    Seedream Team, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimodal image generation.arXiv preprint arXiv:2509.20427, 2025. 1, 2, 4, 5, 9

  47. [47]

    Attention is all you need.NeurIPS, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.NeurIPS, 30, 2017. 4

  48. [48]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 5

  49. [49]

    Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025

    Linqing Wang, Ximing Xing, Yiji Cheng, Zhiyuan Zhao, Donghao Li, Tiankai Hang, Jiale Tao, Qixun Wang, Ruihuang Li, Comi Chen, et al. Promptenhancer: A simple approach to enhance text-to-image models via chain-of-thought prompt rewriting.arXiv preprint arXiv:2509.04545, 2025. 4

  50. [50]

    MS-diffusion: Multi-subject zero-shot image personalization with layout guidance

    Xierui Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. MS-diffusion: Multi-subject zero-shot image personalization with layout guidance. InThe Thirteenth International Conference on Learning Representations,

  51. [51]

    URLhttps://openreview.net/forum?id=PJqP0wyQek. 8

  52. [52]

    Instancediffusion: Instance-level control for image generation

    Xudong Wang, Trevor Darrell, Sai Saketh Rambhatla, Rohit Girdhar, and Ishan Misra. Instancediffusion: Instance-level control for image generation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6232–6242, 2024. 2, 3, 8, 17

  53. [53]

    Mint: Multi-modal chain of thought in unified generative models for enhanced image generation.arXiv:2503.01298, 2025

    Yi Wang, Mushui Liu, Wanggui He, Longxiang Zhang, Ziwei Huang, Guanghao Zhang, Fangxun Shu, Zhong Tao, Dong She, Zhelun Yu, et al. Mint: Multi-modal chain of thought in unified generative models for enhanced image generation.arXiv:2503.01298, 2025. 4

  54. [54]

    Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems, 37:128374–128395, 2024

    Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image generation and editing.Advances in Neural Information Processing Systems, 37:128374–128395, 2024. 4

  55. [55]

    Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation.arXiv preprint arXiv:2401.15688,

    Zhenyu Wang, Enze Xie, Aoxue Li, Zhongdao Wang, Xihui Liu, and Zhenguo Li. Divide and conquer: Language models can plan and self-correct for compositional text-to-image generation.arXiv preprint arXiv:2401.15688,

  56. [56]

    Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 9, 18

  57. [57]

    Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, Ze Liu, Ziyi Xia, Chaofan Li, Haoge Deng, Jiahao Wang, Kun Luo, Bo Zhang, Defu Lian, Xinlong Wang, Zhongyuan Wang, Tiejun Huang, and Zheng Liu. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 202...

  58. [58]

    Self-correcting llm-controlled diffusion models

    Tsung-Han Wu, Long Lian, Joseph E Gonzalez, Boyi Li, and Trevor Darrell. Self-correcting llm-controlled diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6327–6336, 2024. 4

  59. [59]

    Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025

    Ji Xie, Trevor Darrell, Luke Zettlemoyer, and XuDong Wang. Reconstruction alignment improves unified multimodal models.arXiv preprint arXiv:2509.07295, 2025. 4

  60. [60]

    Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms

    Ling Yang, Zhaochen Yu, Chenlin Meng, Minkai Xu, Stefano Ermon, and Bin Cui. Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms. InInternational Conference on Machine Learning, 2024. 4

  61. [61]

    Reco: Region-controlled text-to-image generation

    Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, et al. Reco: Region-controlled text-to-image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14246–14255, 2023. 2, 3, 5, 8

  62. [62]

    Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025

    Yang Ye, Xianyi He, Zongjian Li, Bin Lin, Shenghai Yuan, Zhiyuan Yan, Bohan Hou, and Li Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275, 2025. 3, 7

  63. [63]

    Eligen: Entity-level controlled image generation with regional attention.arXiv preprint arXiv:2501.01097, 2025

    Hong Zhang, Zhongjie Duan, Xingjun Wang, Yingda Chen, and Yu Zhang. Eligen: Entity-level controlled image generation with regional attention.arXiv preprint arXiv:2501.01097, 2025. 1, 5, 8, 17 14

  64. [64]

    Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation.arXiv preprint arXiv:2412.03859, 2024

    Hui Zhang, Dexiang Hong, Tingwei Gao, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, and Yu-Gang Jiang. Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation.arXiv preprint arXiv:2412.03859, 2024. 2, 3, 5

  65. [65]

    Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation

    Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, and Yu-Gang Jiang. Creatilayout: Siamese multimodal diffusion transformer for creative layout-to-image generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 18487–18497, 2025. 8

  66. [66]

    Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024

    Xinchen Zhang, Ling Yang, Guohao Li, Yaqi Cai, Jiake Xie, Yong Tang, Yujiu Yang, Mengdi Wang, and Bin Cui. Itercomp: Iterative composition-aware feedback learning from model gallery for text-to-image generation.arXiv preprint arXiv:2410.07171, 2024. 4

  67. [67]

    Layercraft: Enhancing text-to-image generation with cot reasoning and layered object integration.arXiv preprint arXiv:2504.00010, 2025

    Yuyao Zhang, Jinghao Li, and Yu-Wing Tai. Layercraft: Enhancing text-to-image generation with cot reasoning and layered object integration.arXiv preprint arXiv:2504.00010, 2025. 4

  68. [68]

    Enabling instructional image editing with in-context generation in large scale diffusion transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. Enabling instructional image editing with in-context generation in large scale diffusion transformer. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. 10

  69. [69]

    Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration

    Chen Zhao, Weiling Cai, Chenyu Dong, and Chengwei Hu. Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8281–8291, 2024. 4

  70. [70]

    From zero to detail: Deconstructing ultra-high-definition image restoration from progressive spectral perspective

    Chen Zhao, Zhizhou Chen, Yunzhe Xu, Enxuan Gu, Jian Li, Zili Yi, Qian Wang, Jian Yang, and Ying Tai. From zero to detail: Deconstructing ultra-high-definition image restoration from progressive spectral perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17935–17946, 2025

  71. [71]

    Luve: Latent-cascaded ultra-high-resolution video generation with dual frequency experts.arXiv preprint arXiv:2602.11564, 2026

    Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, and Ying Tai. Luve: Latent-cascaded ultra-high-resolution video generation with dual frequency experts.arXiv preprint arXiv:2602.11564, 2026

  72. [72]

    Ultrahr-100k: Enhancing uhr image synthesis with a large-scale high-quality dataset.Advances in Neural Information Processing Systems, 38:3373–3393, 2026

    Chen Zhao, En Ci, Yunzhe Xu, Tiehan Fan, Shanyan Guan, Yanhao Ge, Jian Yang, and Ying Tai. Ultrahr-100k: Enhancing uhr image synthesis with a large-scale high-quality dataset.Advances in Neural Information Processing Systems, 38:3373–3393, 2026

  73. [73]

    From zero to detail: A progressive spectral decoupling paradigm for uhd image restoration with new benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

    Chen Zhao, Yunzhe Xu, Zhizhou Chen, Enxuan Gu, Kai Zhang, Xiaoming Liu, Jian Yang, and Ying Tai. From zero to detail: A progressive spectral decoupling paradigm for uhd image restoration with new benchmark.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026. 4

  74. [74]

    Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. Ultraedit: Instruction-based fine-grained image editing at scale.Advances in Neural Information Processing Systems, 37:3058–3093, 2024. 10

  75. [75]

    Migc: Multi-instance generation controller for text-to-image synthesis

    Dewei Zhou, You Li, Fan Ma, Xiaoting Zhang, and Yi Yang. Migc: Multi-instance generation controller for text-to-image synthesis. InCVPR, 2024. 3, 7, 8, 16, 17

  76. [76]

    3dis: Depth-driven decoupled instance synthesis for text-to-image generation.arXiv preprint arXiv:2410.12669, 2024

    Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 3dis: Depth-driven decoupled instance synthesis for text-to-image generation.arXiv preprint arXiv:2410.12669, 2024. 2, 3, 5

  77. [77]

    Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268,

    Dewei Zhou, Mingwei Li, Zongxin Yang, Yu Lu, Yunqiu Xu, Zhizhong Wang, Zeyi Huang, and Yi Yang. Bidedpo: Conditional image generation with simultaneous text and condition alignment.arXiv preprint arXiv:2511.19268,

  78. [78]

    Dreamrenderer: Taming multi-instance attribute control in large-scale text-to-image models

    Dewei Zhou, Mingwei Li, Zongxin Yang, and Yi Yang. Dreamrenderer: Taming multi-instance attribute control in large-scale text-to-image models. InICCV, 2025. 1, 5

  79. [79]

    3dis-flux: simple and efficient multi-instance generation with dit rendering.arXiv preprint arXiv:2501.05131, 2025

    Dewei Zhou, Ji Xie, Zongxin Yang, and Yi Yang. 3dis-flux: simple and efficient multi-instance generation with dit rendering.arXiv preprint arXiv:2501.05131, 2025. 3

  80. [80]

    Refineanything: Multimodal region-specific refinement for perfect local details.arXiv preprint arXiv:2604.06870, 2026

    Dewei Zhou, You Li, Zongxin Yang, and Yi Yang. Refineanything: Multimodal region-specific refinement for perfect local details.arXiv preprint arXiv:2604.06870, 2026. 4

Showing first 80 references.