pith. sign in

arxiv: 2606.01803 · v1 · pith:45EKVN4Dnew · submitted 2026-06-01 · 💻 cs.AI

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

Pith reviewed 2026-06-28 14:20 UTC · model grok-4.3

classification 💻 cs.AI
keywords text-to-image routingagentic frameworkself-evolving mechanismPSEL loopinference efficiencyGenEval benchmarkmulti-model selectionconceptual dimensions
0
0 comments X

The pith

OctoT2I builds a knowledge base from scratch so a stateful router can pick the right text-to-image tool for each prompt without human rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OctoT2I as a way to treat text-to-image generation as joint optimization of output quality and inference cost. It replaces handcrafted priors and single-path decisions with a self-evolving mechanism that first names conceptual dimensions such as style and count, then uses an iterative PSEL loop to map each tool's strengths. The resulting memory enables multi-round adaptive routing that selects models on the fly. Experiments report a GenEval score of 0.96 together with large measured reductions in time and energy relative to the prior leading method. A sympathetic reader would see this as evidence that unsupervised exploration of tool frontiers can deliver both accuracy and efficiency at once.

Core claim

OctoT2I reformulates the T2I task as a joint optimization of generation quality and inference efficiency. It implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by the Self-Evolving Mechanism, which autonomously defines foundational Conceptual Dimensions and explores their combinations via the iterative PSEL loop to discover each tool's capability frontier without external guidance.

What carries the argument

The Self-Evolving Mechanism that uses the PSEL (Propose-Solve-Evaluate-Learn) loop to define Conceptual Dimensions and map tool capability frontiers autonomously.

If this is right

  • The router reaches a GenEval score of 0.96.
  • Inference runs 90.3 percent faster than the Flow-GRPO baseline.
  • Energy use drops by 56.6 percent relative to the same baseline.
  • Continuous PSEL iteration produces ongoing improvement in routing decisions without added human input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unsupervised mapping process could be applied to select among video or audio generators once their capability frontiers are defined.
  • Production systems could keep the PSEL loop running online to adapt routing as new model versions appear.
  • If the discovered dimensions prove stable, manual benchmark suites for T2I tools might be replaced by automated frontier maps.
  • The stateful memory could support prompt refinement across rounds rather than one-shot selection.

Load-bearing premise

The self-evolving mechanism can autonomously define Conceptual Dimensions and explore their combinations via the PSEL loop to accurately discover each tool's capability frontier with no human supervision or external guidance.

What would settle it

Run the PSEL loop on the same set of T2I tools, then measure whether the resulting routing decisions match or exceed the performance of routes chosen by human experts on a fixed test prompt set; a large gap would falsify the autonomy claim.

Figures

Figures reproduced from arXiv: 2606.01803 by Bin Chen, Gehui Li, Jian Zhang, Ronggang Wang, Xu Jiang, Yule Duan.

Figure 1
Figure 1. Figure 1: The core advantages of OctoT2I in performance, decision-making, and knowledge acquisition. (a) OctoT2I (Ours) achieves an exceptional performance-efficiency balance on GenEval. (b) This superior performance stems from its intelligent, evidence-based decisions, which route each user prompt to the most suitable T2I model. For example, it selects the most efficient tool (sd-turbo) for “a photo of a snowboard”… view at source ↗
Figure 2
Figure 2. Figure 2: The overall architecture of OctoT2I. (Top) Inference Workflow. The agent takes a user prompt and executes a “reason-act￾reflect” loop: The decision policy leverages its memory and knowledge modules to select an appropriate tool. This tool is then used to generate an image, which is subsequently evaluated to obtain a quality score. The workflow terminates when the score meets the quality threshold or the ro… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison with agentic T2I methods [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study of θ on performance (green, left axis) and inference time (purple, right axis) on the GenEval benchmark. spatial layout. In contrast, Idea2Img, GenArtist, and Chat￾Gen all fail to adhere to the spatial instruction. Furthermore, when faced with prompts requiring the combination of mul￾tiple unique objects (e.g., “The conical hat and the spheri￾cal ball...”), OctoT2I correctly generates aesthe… view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparison on real-world user prompts [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

The explosive growth of Text-to-Image (T2I) models, from large-scale versions to lightweight, real-time ones, now faces diminishing marginal returns from single-model scaling. Agentic T2I methods emerged to alleviate this bottleneck by using multiple models. However, existing agentic T2I methods suffer from three key challenges: reliance on expensive handcrafted priors or human annotations, rigid single-path decision mechanisms, and a neglect of inference efficiency. To address these challenges, we introduce OctoT2I, a novel agentic framework that reformulates the T2I task as a joint optimization of generation quality and inference efficiency. OctoT2I implements a stateful, multi-round routing strategy that adaptively selects the most suitable tool based on its knowledge and memory. This strategy is enabled by a knowledge base built from scratch by our novel Self-Evolving Mechanism. This mechanism, which requires no human supervision, first autonomously defines foundational Conceptual Dimensions (eg, style, color, count) and then intelligently explores their combinations via an iterative" Propose--Solve--Evaluate--Learn"(PSEL) loop. The PSEL loop efficiently discovers each tool's capability frontier, driving continuous improvement without external guidance. Extensive experiments demonstrate that OctoT2I achieves competitive performance (0.96) on GenEval while delivering a 90.3% inference speedup and a 56.6% energy-efficiency gain over the leading baseline (Flow-GRPO), striking an exceptional balance between performance and efficiency. Code and models will be made available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents OctoT2I, an agentic text-to-image routing framework that reformulates T2I as joint optimization of quality and efficiency. It introduces a self-evolving mechanism that autonomously defines Conceptual Dimensions (e.g., style, color, count) and iteratively explores tool capability frontiers via a stateful PSEL (Propose–Solve–Evaluate–Learn) loop with no human supervision, building a knowledge base from scratch to enable adaptive multi-round routing. The paper reports that this yields a GenEval score of 0.96 together with a 90.3% inference speedup and 56.6% energy-efficiency gain relative to the Flow-GRPO baseline.

Significance. If the self-evolving PSEL mechanism can be shown to operate without implicit priors or external guidance, the work would constitute a meaningful advance in agentic T2I systems by removing reliance on handcrafted priors and rigid single-path routing while simultaneously improving inference efficiency. The reported combination of competitive quality and large efficiency gains would be notable for practical deployment of multi-model T2I pipelines.

major comments (2)
  1. [Abstract] Abstract: the central performance claims (0.96 GenEval, 90.3% speedup, 56.6% energy gain) rest on the assertion that the PSEL loop autonomously defines Conceptual Dimensions and accurately discovers each tool’s capability frontier from scratch. No technical specification is given for the initial state before the first Propose step, the precise evaluation criteria inside Evaluate, or any safeguards against drift or incomplete coverage; without these the reported numbers cannot be verified as arising from the claimed mechanism rather than from base-LLM priors.
  2. [Abstract] Abstract: the claim that the knowledge base is “built from scratch” with “no human supervision” is load-bearing for the generalization of the routing decisions. The manuscript provides no description of how the loop initializes, whether seeded examples or LLM-internal knowledge are used, or how the iterative process guarantees exhaustive frontier discovery; this directly affects whether the efficiency numbers are reproducible or generalizable.
minor comments (1)
  1. The abstract states that code and models will be made available but supplies no repository link, license, or expected release date, which would facilitate verification of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The two major comments correctly identify that the abstract (and, by extension, the current manuscript) lacks sufficient technical detail on the PSEL loop's initialization, evaluation criteria, and safeguards. We will revise the manuscript to supply these specifications.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claims (0.96 GenEval, 90.3% speedup, 56.6% energy gain) rest on the assertion that the PSEL loop autonomously defines Conceptual Dimensions and accurately discovers each tool’s capability frontier from scratch. No technical specification is given for the initial state before the first Propose step, the precise evaluation criteria inside Evaluate, or any safeguards against drift or incomplete coverage; without these the reported numbers cannot be verified as arising from the claimed mechanism rather than from base-LLM priors.

    Authors: We agree that the current manuscript does not provide these specifications at the level of detail required for verification. In the revised version we will add an explicit subsection (new Section 3.2.1) that states: (1) the loop begins with an empty knowledge base and no seeded examples or task-specific priors; (2) the Evaluate step uses only automated, reference-free metrics (GenEval-style scoring plus measured latency/energy) with no human judgment; and (3) drift and coverage safeguards consist of a termination condition that halts when three consecutive Propose steps yield no new frontier dimensions and a consistency check that re-evaluates a random sample of prior frontiers. These additions will allow readers to confirm that the reported numbers arise from the described mechanism. revision: yes

  2. Referee: [Abstract] Abstract: the claim that the knowledge base is “built from scratch” with “no human supervision” is load-bearing for the generalization of the routing decisions. The manuscript provides no description of how the loop initializes, whether seeded examples or LLM-internal knowledge are used, or how the iterative process guarantees exhaustive frontier discovery; this directly affects whether the efficiency numbers are reproducible or generalizable.

    Authors: We concur that the manuscript currently omits these initialization and coverage details. The revision will include a precise account of the zero-shot start (empty memory, no human-provided seeds or annotations), clarification that the base LLM supplies only general reasoning capacity rather than pre-loaded task knowledge, and the exhaustive-discovery guarantee (iterative expansion continues until the Propose step produces no novel Conceptual Dimensions for a fixed number of rounds, with an explicit coverage metric reported in the experiments). These changes will directly support the reproducibility and generalizability claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents an empirical agentic framework whose performance claims (GenEval score, speedup, energy gains) rest on external benchmarks and experimental comparisons rather than any internal derivation. The PSEL loop and autonomous dimension definition are described as building a knowledge base from scratch, but the abstract and provided text contain no equations, fitted parameters renamed as predictions, or self-referential reductions that would make the reported results equivalent to the loop's own outputs by construction. No self-citation load-bearing steps or ansatz smuggling are identifiable from the given material. The derivation is therefore self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that distinct capability frontiers exist and can be discovered autonomously; no free parameters or invented entities with independent evidence are stated in the abstract.

axioms (1)
  • domain assumption T2I models possess distinct and discoverable capability frontiers across prompt dimensions such as style, color, and count.
    This premise enables the PSEL loop to improve routing without external labels.
invented entities (2)
  • Conceptual Dimensions no independent evidence
    purpose: Autonomously defined categories used to explore model capabilities.
    Introduced by the self-evolving mechanism; no independent evidence supplied.
  • PSEL loop no independent evidence
    purpose: Iterative propose-solve-evaluate-learn process that builds the knowledge base.
    Core mechanism of the self-evolving framework; no independent evidence supplied.

pith-pipeline@v0.9.1-grok · 5825 in / 1266 out tokens · 24817 ms · 2026-06-28T14:20:38.603033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 40 canonical work pages · 20 internal anchors

  1. [1]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023. 3

  2. [2]

    Improving image generation with better captions.Computer Science

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions.Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023. 6

  3. [3]

    Adversarial diffusion compression for real-world image super-resolution

    Bin Chen, Gehui Li, Rongyuan Wu, Xindong Zhang, Jie Chen, Jian Zhang, and Lei Zhang. Adversarial diffusion compression for real-world image super-resolution. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 28208–28220, 2025. 2

  4. [4]

    Improved adversarial diffusion compression for real-world video super-resolution.arXiv preprint arXiv:2603.00458, 2026

    Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, and Jian Zhang. Improved adversarial diffusion compression for real-world video super-resolution.arXiv preprint arXiv:2603.00458, 2026. 2

  5. [5]

    T2i-copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive gen- eration

    Chieh-Yun Chen, Min Shi, Gong Zhang, and Humphrey Shi. T2i-copilot: A training-free multi-agent text-to-image sys- tem for enhanced prompt interpretation and interactive gen- eration. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 19396–19405, 2025. 2, 3

  6. [6]

    Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023

    Guangyao Chen, Siwei Dong, Yu Shu, Ge Zhang, Jaward Sesay, B¨orje F Karlsson, Jie Fu, and Yemin Shi. Autoagents: A framework for automatic agent generation.arXiv preprint arXiv:2309.17288, 2023. 2, 3

  7. [7]

    Pixart-α: Fast training of diffusion trans-former for photorealistic text-to-image syn- thesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of diffusion trans-former for photorealistic text-to-image syn- thesis. 2

  8. [8]

    Sana-sprint: One-step diffusion with continuous-time con- sistency distillation.arXiv preprint arXiv:2503.09641, 2025

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time con- sistency distillation.arXiv preprint arXiv:2503.09641, 2025. 2, 3, 6

  9. [9]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus- pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811,

  10. [10]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 6

  11. [11]

    Scaling recti- fied flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M ¨uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning,

  12. [12]

    Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 5, 6

  13. [13]

    Chatllm network: More brains, more intel- ligence.arXiv preprint arXiv:2304.12998, 2023

    Rui Hao, Linmei Hu, Weijian Qi, Qingliu Wu, Yirui Zhang, and Liqiang Nie. Chatllm network: More brains, more intel- ligence.arXiv preprint arXiv:2304.12998, 2023

  14. [14]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta pro- gramming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023. 2

  15. [15]

    Talkphoto: A versatile training-free conversa- tional assistant for intelligent image editing.arXiv preprint arXiv:2601.01915, 2026

    Yujie Hu, Zecheng Tang, Xu Jiang, Weiqi Li, and Jian Zhang. Talkphoto: A versatile training-free conversa- tional assistant for intelligent image editing.arXiv preprint arXiv:2601.01915, 2026. 3

  16. [16]

    T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to- Image Generation .IEEE Transactions on Pattern Analysis Machine Intelligence, (01):1–17, 5555

    Kaiyi Huang, Chengqi Duan, Kaiyue Sun, Enze Xie, Zhen- guo Li, and Xihui Liu. T2I-CompBench++: An Enhanced and Comprehensive Benchmark for Compositional Text-to- Image Generation .IEEE Transactions on Pattern Analysis Machine Intelligence, (01):1–17, 5555. 5, 6

  17. [17]

    UniShield: An Adaptive Multi-Agent Framework for Unified Forgery Image Detection and Localization

    Qing Huang, Zhipei Xu, Xuanyu Zhang, and Jian Zhang. Unishield: An adaptive multi-agent framework for unified forgery image detection and localization.arXiv preprint arXiv:2510.03161, 2025. 3

  18. [18]

    Open-endedness is essen- tial for artificial superhuman intelligence.arXiv preprint arXiv:2406.04268, 2024

    Edward Hughes, Michael Dennis, Jack Parker-Holder, Feryal Behbahani, Aditi Mavalankar, Yuge Shi, Tom Schaul, and Tim Rocktaschel. Open-endedness is essen- tial for artificial superhuman intelligence.arXiv preprint arXiv:2406.04268, 2024. 2

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6

  20. [20]

    Chatgen: Automatic text- to-image generation from freestyle chatting.arXiv preprint arXiv:2411.17176, 2024

    Chengyou Jia, Changliang Xia, Zhuohang Dang, Weijia Wu, Hangwei Qian, and Minnan Luo. Chatgen: Automatic text- to-image generation from freestyle chatting.arXiv preprint arXiv:2411.17176, 2024. 2, 3, 6, 8

  21. [21]

    Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025

    Xu Jiang, Gehui Li, Bin Chen, and Jian Zhang. Multi-agent image restoration.arXiv preprint arXiv:2503.09403, 2025. 3

  22. [22]

    DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

    Hongbo Jin, Rongpeng Zhu, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, and Jiayu Ding. Dgpo: Distribu- tion guided policy optimization for fine grained credit assign- ment.arXiv preprint arXiv:2605.03327, 2026. 3

  23. [23]

    An llm compiler for parallel function calling.arXiv preprint arXiv:2312.04511, 2023

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. An llm compiler for parallel function calling.arXiv preprint arXiv:2312.04511, 2023

  24. [24]

    Flux, 2024

    Black Forest Labs. Flux, 2024. 2, 3, 6

  25. [25]

    Camel: Communicative agents for” mind” exploration of large language model society

    Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large language model society. Advances in Neural Information Processing Systems, 36: 51991–52008, 2023. 2

  26. [26]

    SDXL-Lightning: Progressive Adversarial Diffusion Distillation

    Shanchuan Lin, Anran Wang, and Xiao Yang. Sdxl- lightning: Progressive adversarial diffusion distillation. arXiv preprint arXiv:2402.13929, 2024. 2, 3, 6

  27. [27]

    Evaluating text-to-visual generation with image-to-text gen- eration.arXiv preprint arXiv:2404.01291, 2024

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration.arXiv preprint arXiv:2404.01291, 2024. 4

  28. [28]

    Playground v3: Im- proving text-to-image alignment with deep-fusion large lan- guage models.arXiv preprint arXiv:2409.10695, 2024

    Bingchen Liu, Ehsan Akhgari, Alexander Visheratin, Aleks Kamko, Linmiao Xu, Shivam Shrirao, Chase Lambert, Joao Souza, Suhail Doshi, and Daiqing Li. Playground v3: Im- proving text-to-image alignment with deep-fusion large lan- guage models.arXiv preprint arXiv:2409.10695, 2024. 2, 3

  29. [29]

    Flow-GRPO: Training Flow Matching Models via Online RL

    Jie Liu, Gongye Liu, Jiajun Liang, Yangguang Li, Jiaheng Liu, Xintao Wang, Pengfei Wan, Di Zhang, and Wanli Ouyang. Flow-grpo: Training flow matching models via on- line rl.arXiv preprint arXiv:2505.05470, 2025. 2, 3, 6

  30. [30]

    Nvila: Efficient frontier visual language models, 2024

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yux- ian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vish- wesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang, Pavlo Molchanov, Jan Kautz, Hongxu Yin, Song Han, and Yao Lu. Nvila:...

  31. [31]

    Cot-lized diffusion: Let’s reinforce t2i generation step-by-step.arXiv preprint arXiv:2507.04451, 2025

    Zheyuan Liu, Munan Ning, Qihui Zhang, Shuo Yang, Zhongrui Wang, Yiwei Yang, Xianzhe Xu, Yibing Song, Weihua Chen, Fan Wang, et al. Cot-lized diffusion: Let’s reinforce t2i generation step-by-step.arXiv preprint arXiv:2507.04451, 2025

  32. [32]

    Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autore- gression and rectified flow for unified multimodal under- standing and generation. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 7739–7751,

  33. [33]

    WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation

    Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Kunpeng Ning, Chaoran Feng, Bin Zhu, and Li Yuan. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265, 2025. 6

  34. [34]

    GPT-4 Technical Report

    R OpenAI. Gpt-4 technical report. arxiv 2303.08774.View in Article, 2(5):1, 2023. 3, 6

  35. [35]

    Generative agents: Interactive simulacra of human behavior

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

  36. [36]

    SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion mod- els for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023. 2, 6

  37. [37]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789,

  38. [38]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2, 3, 6

  39. [39]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer,

  40. [40]

    Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face.Advances in Neural Information Processing Systems, 36, 2024. 3

  41. [41]

    Realosr: Latent guidance boosts diffusion-based real-world omnidirectional image super-resolutions.arXiv preprint arXiv:2412.09646, 2024

    Xuhan Sheng, Runyi Li, Bin Chen, Weiqi Li, Xu Jiang, and Jian Zhang. Realosr: Latent guidance boosts diffusion-based real-world omnidirectional image super-resolutions.arXiv preprint arXiv:2412.09646, 2024. 2

  42. [42]

    Energy and policy considerations for modern deep learning research

    Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for modern deep learning research. InProceedings of the AAAI conference on artificial intelligence, pages 13693–13696, 2020. 6

  43. [43]

    Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014

    Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks.Advances in neural information processing systems, 27, 2014. 2

  44. [44]

    Position: Will we run out of data? limits of llm scaling based on human- generated data

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Be- siroglu, Lennart Heim, and Marius Hobbhahn. Position: Will we run out of data? limits of llm scaling based on human- generated data. InForty-first International Conference on Machine Learning, 2024. 2

  45. [45]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 6

  46. [46]

    Emu3: Next-Token Prediction is All You Need

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024. 2

  47. [47]

    Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration

    Z Wang, S Mao, W Wu, T Ge, F Wei, and H Ji. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona selfcollaboration. arxiv 2023. arXiv preprint arXiv:2307.05300, 2023

  48. [48]

    Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024

    Zhenyu Wang, Aoxue Li, Zhenguo Li, and Xihui Liu. Genartist: Multimodal llm as an agent for unified image gen- eration and editing.Advances in Neural Information Pro- cessing Systems, 37:128374–128395, 2024. 2, 3, 6, 7, 8

  49. [49]

    Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau

    Zijie J. Wang, Evan Montoya, David Munechika, Haoyang Yang, Benjamin Hoover, and Duen Horng Chau. Diffu- sionDB: A large-scale prompt gallery dataset for text-to- image generative models.arXiv:2210.14896 [cs], 2022. 6

  50. [50]

    Tiif-bench: How does your t2i model follow your instructions?, 2025

    Xinyu Wei, Jinrui Zhang, Zeqing Wang, Hongyang Wei, Zhen Guo, and Lei Zhang. Tiif-bench: How does your t2i model follow your instructions?, 2025. 5

  51. [51]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models.arXiv preprint arXiv:2303.04671, 2023

  52. [52]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025. 2, 3

  53. [53]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm ap- plications via multi-agent conversation framework.arXiv preprint arXiv:2308.08155, 2023. 3

  54. [54]

    Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer

    Enze Xie, Junsong Chen, Yuyang Zhao, Jincheng Yu, Ligeng Zhu, Chengyue Wu, Yujun Lin, Zhekai Zhang, Muyang Li, Junyu Chen, et al. Sana 1.5: Efficient scaling of training-time and inference-time compute in linear diffusion transformer. arXiv preprint arXiv:2501.18427, 2025. 2, 3, 5, 6

  55. [55]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024. 2, 6

  56. [56]

    Auto-gpt for on- line decision making: Benchmarks and additional opinions

    Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for on- line decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023

  57. [57]

    Gpt4tools: Teaching large language model to use tools via self-instruction.Advances in Neural Information Processing Systems, 36, 2024

    Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction.Advances in Neural Information Processing Systems, 36, 2024. 2, 3

  58. [58]

    Difflle: Diffusion-based do- main calibration for weak supervised low-light image en- hancement.International Journal of Computer Vision, 133 (5):2527–2546, 2025

    Shuzhou Yang, Xuanyu Zhang, Yinhuai Wang, Jiwen Yu, Yuhan Wang, and Jian Zhang. Difflle: Diffusion-based do- main calibration for weak supervised low-light image en- hancement.International Journal of Computer Vision, 133 (5):2527–2546, 2025

  59. [59]

    Gencompositor: Generative video compositing with diffusion transformer

    Shuzhou Yang, Xiaoyu Li, Xiaodong Cun, Guangzhi Wang, Lingen Li, Ying Shan, and Jian Zhang. Gencompositor: Generative video compositing with diffusion transformer. In The Fourteenth International Conference on Learning Rep- resentations, 2026. 2

  60. [60]

    Idea2img: Iterative self-refinement with gpt-4v(ision) for automatic im- age design and generation.arXiv preprint arXiv:2310.08541,

    Zhengyuan Yang, Jianfeng Wang, Linjie Li, Kevin Lin, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Idea2img: Iterative self-refinement with gpt-4v(ision) for automatic im- age design and generation.arXiv preprint arXiv:2310.08541,

  61. [61]

    Alignedgen: Aligning style across generated images.arXiv preprint arXiv:2509.17088, 2025

    Jiexuan Zhang, Yiheng Du, Qian Wang, Weiqi Li, Yu Gu, and Jian Zhang. Alignedgen: Aligning style across generated images.arXiv preprint arXiv:2509.17088, 2025. 2

  62. [62]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reason- ing with zero data.arXiv preprint arXiv:2505.03335, 2025. 2

  63. [63]

    Diffagent: Fast and accurate text-to-image api selection with large lan- guage model

    Lirui Zhao, Yue Yang, Kaipeng Zhang, Wenqi Shao, Yuxin Zhang, Yu Qiao, Ping Luo, and Rongrong Ji. Diffagent: Fast and accurate text-to-image api selection with large lan- guage model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6390– 6399, 2024. 2, 3

  64. [64]

    Mindstorms in natural language- based societies of mind.arXiv preprint arXiv:2305.17066, 2023

    Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, R ´obert Csord ´as, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Her- rmann, Kazuki Irie, et al. Mindstorms in natural language- based societies of mind.arXiv preprint arXiv:2305.17066, 2023