pith. machine review for the scientific record. sign in

arxiv: 2512.21788 · v3 · submitted 2025-12-25 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords InstructMoLEMixture of Low-rank ExpertsDiffusion TransformersParameter-Efficient Fine-TuningMulti-Conditional Image GenerationInstruction-Guided RoutingLoRA
0
0 comments X

The pith

Global routing from the full user instruction lets low-rank expert mixtures generate coherent multi-conditional images without token-level conflicts or drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Parameter-efficient fine-tuning of diffusion transformers for tasks with several simultaneous conditions often produces interference when a single adapter such as LoRA must satisfy all constraints at once. The paper replaces per-token expert routing with a single routing decision computed from the entire instruction text; this global choice selects one consistent set of low-rank experts and applies it to every token. An output-space orthogonality loss is added to keep the selected experts functionally distinct. Experiments on challenging multi-conditional benchmarks show that the resulting model outperforms both standard LoRA adapters and earlier mixture-of-experts variants.

Core claim

InstructMoLE computes an Instruction-Guided Routing signal directly from the complete user instruction and broadcasts the same expert council to all tokens in the diffusion transformer. An accompanying output-space orthogonality loss prevents the experts from collapsing into redundant representations. The combination produces images that respect every supplied condition with less spatial fragmentation and semantic drift than token-level routing permits.

What carries the argument

Instruction-Guided Routing (IGR): a global signal derived from the full instruction that selects one expert council and applies it uniformly across all input tokens.

If this is right

  • Global semantics stay consistent across the entire generated image even when multiple conditions must be satisfied simultaneously.
  • Spatial fragmentation between different regions of the scene is reduced.
  • Expert representations remain diverse, limiting collapse into redundant functions.
  • Parameter-efficient fine-tuning scales to prompts that combine several distinct control signals without extra interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-routing principle could be tested in language or video models where prompt-level coherence matters more than local token decisions.
  • An orthogonality loss defined in output space might generalize to other mixture-of-experts architectures to maintain specialization across modalities.
  • Instruction-level routing decisions could replace token-level ones in any transformer generator that must balance several objectives at once.

Load-bearing premise

That one routing choice based on the whole instruction remains appropriate for every local region of the image being generated.

What would settle it

A side-by-side test on the same multi-conditional prompts in which per-token routing produces measurably fewer artifacts or higher fidelity than the global routing method.

read the original abstract

Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces InstructMoLE for parameter-efficient fine-tuning of Diffusion Transformers (DiTs) in multi-conditional image generation. It replaces token-level routing in Mixture of Low-rank Experts (MoLE) with Instruction-Guided Routing (IGR), which derives a single global expert combination from the full user instruction and applies it uniformly to all tokens. An output-space orthogonality loss is added to encourage expert diversity and prevent collapse. The central claim is that this global mechanism outperforms standard LoRA adapters and prior MoLE variants by preserving global semantics and reducing spatial fragmentation and semantic drift.

Significance. If the performance claims hold with proper controls, the work offers a coherent alternative to local routing for instruction-driven generation, potentially improving compositional fidelity in tasks with global user intent. The IGR mechanism and orthogonality loss are presented as targeted fixes for known MoLE limitations in DiT fine-tuning.

major comments (2)
  1. [Abstract] Abstract: the claim that 'extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants' is load-bearing for the central contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, or specific benchmark scores, preventing assessment of whether gains are attributable to IGR or to baseline selection.
  2. [Method (IGR)] Routing mechanism (IGR description): the global routing decision is derived from the full instruction embedding and broadcast uniformly to every DiT token; for spatially heterogeneous prompts (e.g., 'red car on left, blue sky on right'), this uniform application risks averaging away necessary local low-rank adaptations, and no experiment isolates the fidelity loss relative to per-token routing.
minor comments (2)
  1. [Training objective] Clarify the precise mathematical definition of the output-space orthogonality loss, including its weighting coefficient in the total objective and how orthogonality is measured across expert outputs.
  2. [Model architecture] Provide the exact architecture details for the instruction encoder used to produce the IGR signal and how it interfaces with the DiT blocks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below, clarifying our design choices and proposing revisions where appropriate to strengthen the presentation of InstructMoLE.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants' is load-bearing for the central contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, or specific benchmark scores, preventing assessment of whether gains are attributable to IGR or to baseline selection.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we will update the abstract to report key metrics from our experiments, including specific improvements in FID scores, CLIP similarity, and other benchmarks on multi-conditional generation tasks, along with direct comparisons to LoRA and prior MoLE variants. This will provide readers with immediate evidence of the performance gains. revision: yes

  2. Referee: [Method (IGR)] Routing mechanism (IGR description): the global routing decision is derived from the full instruction embedding and broadcast uniformly to every DiT token; for spatially heterogeneous prompts (e.g., 'red car on left, blue sky on right'), this uniform application risks averaging away necessary local low-rank adaptations, and no experiment isolates the fidelity loss relative to per-token routing.

    Authors: We acknowledge the potential concern that uniform global routing could average out local adaptations for highly spatially heterogeneous prompts. However, our motivation for IGR stems from empirical observations that per-token routing in MoLE frequently produces spatial fragmentation and semantic drift when instructions convey global intent, as demonstrated in our qualitative results and failure case analyses. To directly address the request for isolation, we will include a new ablation in the revised manuscript comparing IGR to a per-token routing variant on spatially heterogeneous prompts, reporting metrics for both global coherence and regional fidelity. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained

full rationale

The paper defines Instruction-Guided Routing (IGR) as a global routing signal extracted from the full user instruction and an output-space orthogonality loss to encourage expert diversity. These components are introduced as architectural responses to the stated limitations of token-level routing in prior MoLE variants. Performance claims rest on experimental comparisons against LoRA and MoLE baselines on multi-conditional benchmarks rather than any equation that reduces the reported gains to a fitted parameter, self-referential definition, or self-citation chain. No load-bearing uniqueness theorems, ansatzes imported via author citations, or renamings of known results appear in the provided text. The central claims therefore remain independent of the inputs they are evaluated against.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces two new mechanisms (global instruction-guided routing and output-space orthogonality loss) whose functional forms are not derived from prior literature; no explicit free parameters or background axioms beyond standard neural-network training are stated in the abstract.

invented entities (2)
  • Instruction-Guided Routing (IGR) no independent evidence
    purpose: Derive a single global expert-selection signal from the complete user instruction instead of per-token decisions
    New routing policy introduced to address spatial fragmentation and semantic drift
  • output-space orthogonality loss no independent evidence
    purpose: Encourage functional diversity among the low-rank experts to prevent representational collapse
    New auxiliary loss term proposed to complement the routing change

pith-pipeline@v0.9.0 · 5569 in / 1282 out tokens · 26173 ms · 2026-05-16T19:07:27.995205+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    Cunningham

    Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham. Lora learns less and forgets less, 2024. URLhttps://arxiv.org/abs/2405.09673

  2. [2]

    Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

    Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

  3. [3]

    Octavius: Mitigating task interference in mllms via lora-moe

    Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, and Jing Shao. Octavius: Mitigating task interference in mllms via lora-moe. InThe Twelfth International Conference on Learning Representations

  4. [4]

    Arcface: Additive angular margin loss for deep face recognition

    Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  5. [5]

    Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin

    Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Wei Shen, Limao Xiong, Yuhao Zhou, Xiao Wang, Zhiheng Xi, Xiaoran Fan, et al. Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1932–1945, 2024

  6. [6]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  7. [7]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

  8. [8]

    arXiv preprint arXiv:2407.11633 , year=

    Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024. URLhttps://arxiv.org/abs/2407.11633

  9. [9]

    Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

    Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

  10. [10]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024. URLhttps://arxiv.org/abs/2403.14608

  11. [11]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  12. [12]

    Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

  13. [13]

    Perceiver: General perception with iterative attention

    Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational conference on machine learning, pages 4651–4664. PMLR, 2021

  14. [14]

    Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

    Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

  15. [15]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  16. [16]

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

  17. [17]

    Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts.CoRR, 2024

    Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang. Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts.CoRR, 2024

  18. [18]

    Theory on mixture-of-experts in continual learning

    Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness Shroff. Theory on mixture-of-experts in continual learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=7XgKAabsPp. 12

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

  20. [20]

    Step1X-Edit: A Practical Framework for General Image Editing

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing, 2025. UR...

  21. [21]

    Beyond standard moe: Mixture of latent experts for resource-efficient language models.arXiv e-prints, pages arXiv–2503, 2025

    Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, and Mingxuan Yuan. Beyond standard moe: Mixture of latent experts for resource-efficient language models.arXiv e-prints, pages arXiv–2503, 2025

  22. [22]

    Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

    Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, and Xiangxiang Chu. Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

  23. [23]

    Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

    Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

  24. [24]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

  25. [25]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  26. [26]

    Unicontrol: A unified diffusion model for controllable visual generation in the wild

    Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. Advances in Neural Information Processing Systems, 36:42961–42992, 2023

  27. [27]

    Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

  28. [28]

    Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing

    Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing. InThe Thirteenth International Conference on Learning Representations

  29. [29]

    A stronger mixture of low-rank experts for fine-tuning foundation models

    Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, and Jie Tang. A stronger mixture of low-rank experts for fine-tuning foundation models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=yqyEUcGreT

  30. [30]

    Gpt-image-edit- 1.5m: A million-scale, gpt-generated image dataset, 2025

    Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, and Cihang Xie. Gpt-image-edit- 1.5m: A million-scale, gpt-generated image dataset, 2025. URLhttps://arxiv.org/abs/2507.21033

  31. [31]

    Qwen-Image Technical Report

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

  32. [32]

    OmniGen2: Towards Instruction-Aligned Multimodal Generation

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

  33. [33]

    Routing experts: Learning to route dynamic experts in existing multi-modal large language models

    Qiong Wu, Zhaoxi Ke, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Routing experts: Learning to route dynamic experts in existing multi-modal large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 13

  34. [34]

    Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

    Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

  35. [35]

    Mixture of lora experts

    Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts. InThe Twelfth International Conference on Learning Representations

  36. [36]

    Ecospa: Efficient transformer training with coupled sparsity

    Jinqi Xiao, Cheng Luo, Lingyi Huang, Cheng Yang, Yang Sui, Huy Phan, Xiao Zang, Yibiao Ying, Anima Anandkumar, and Bo Yuan. Ecospa: Efficient transformer training with coupled sparsity. InNeurIPS 2025 Workshop on Efficient Reasoning

  37. [37]

    Comcat: Towards efficient compression and customization of attention-based vision models

    Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, and Bo Yuan. Comcat: Towards efficient compression and customization of attention-based vision models. InInternational Conference on Machine Learning, pages 38125–38136. PMLR, 2023

  38. [38]

    Haloc: hardware-aware automatic low-rank compression for compact neural networks

    Jinqi Xiao, Chengming Zhang, Yu Gong, Miao Yin, Yang Sui, Lizhi Xiang, Dingwen Tao, and Bo Yuan. Haloc: hardware-aware automatic low-rank compression for compact neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10464–10472, 2023

  39. [39]

    Coap: Memory-efficient training with correlation-aware gradient projection.arXiv preprint arXiv:2412.00071, 2024

    Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Yuqian Zhang, Linjie Luo, and Bo Yuan. Coap: Memory-efficient training with correlation-aware gradient projection.arXiv preprint arXiv:2412.00071, 2024

  40. [40]

    Moe-i2: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition

    Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, and Bo Yuan. Moe-i2: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10456–10466, 2024

  41. [41]

    URLhttps://arxiv.org/ abs/2504.13143

    Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie.Complex-Edit: Cot-like instruction generation for complexity-controllable image editing benchmark, 2025. URLhttps://arxiv.org/ abs/2504.13143

  42. [42]

    Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts

    Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, and Qiyang Min. Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts. InForty-second International Conference on Machine Learning

  43. [43]

    Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

    Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

  44. [44]

    In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

    Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 14 Appendix .1 The Use of Large Language Models (LLMs) We acknowledge the use of a large language model (LLM) to aid in the writing process of ...