arxiv: 2512.21788 · v3 · submitted 2025-12-25 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

InstructMoLE: Instruction-Guided Mixture of Low-rank Experts for Multi-Conditional Image Generation

Jinqi Xiao , Qing Yan , Liming Jiang , Zichuan Liu , Hao Kang , Shen Sang , Tiancheng Zhi , Jing Liu

show 3 more authors

Cheng Yang Xin Lu Bo Yuan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 19:07 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords InstructMoLEMixture of Low-rank ExpertsDiffusion TransformersParameter-Efficient Fine-TuningMulti-Conditional Image GenerationInstruction-Guided RoutingLoRA

0 comments

The pith

Global routing from the full user instruction lets low-rank expert mixtures generate coherent multi-conditional images without token-level conflicts or drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Parameter-efficient fine-tuning of diffusion transformers for tasks with several simultaneous conditions often produces interference when a single adapter such as LoRA must satisfy all constraints at once. The paper replaces per-token expert routing with a single routing decision computed from the entire instruction text; this global choice selects one consistent set of low-rank experts and applies it to every token. An output-space orthogonality loss is added to keep the selected experts functionally distinct. Experiments on challenging multi-conditional benchmarks show that the resulting model outperforms both standard LoRA adapters and earlier mixture-of-experts variants.

Core claim

InstructMoLE computes an Instruction-Guided Routing signal directly from the complete user instruction and broadcasts the same expert council to all tokens in the diffusion transformer. An accompanying output-space orthogonality loss prevents the experts from collapsing into redundant representations. The combination produces images that respect every supplied condition with less spatial fragmentation and semantic drift than token-level routing permits.

What carries the argument

Instruction-Guided Routing (IGR): a global signal derived from the full instruction that selects one expert council and applies it uniformly across all input tokens.

If this is right

Global semantics stay consistent across the entire generated image even when multiple conditions must be satisfied simultaneously.
Spatial fragmentation between different regions of the scene is reduced.
Expert representations remain diverse, limiting collapse into redundant functions.
Parameter-efficient fine-tuning scales to prompts that combine several distinct control signals without extra interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-routing principle could be tested in language or video models where prompt-level coherence matters more than local token decisions.
An orthogonality loss defined in output space might generalize to other mixture-of-experts architectures to maintain specialization across modalities.
Instruction-level routing decisions could replace token-level ones in any transformer generator that must balance several objectives at once.

Load-bearing premise

That one routing choice based on the whole instruction remains appropriate for every local region of the image being generated.

What would settle it

A side-by-side test on the same multi-conditional prompts in which per-token routing produces measurably fewer artifacts or higher fidelity than the global routing method.

read the original abstract

Parameter-Efficient Fine-Tuning of Diffusion Transformers (DiTs) for diverse, multi-conditional tasks often suffers from task interference when using monolithic adapters like LoRA. The Mixture of Low-rank Experts (MoLE) architecture offers a modular solution, but its potential is usually limited by routing policies that operate at a token level. Such local routing can conflict with the global nature of user instructions, leading to artifacts like spatial fragmentation and semantic drift in complex image generation tasks. To address these limitations, we introduce InstructMoLE, a novel framework that employs an Instruction-Guided Mixture of Low-Rank Experts. Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens, preserving the global semantics and structural integrity of the generation process. To complement this, we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse. Extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants across challenging multi-conditional generation benchmarks. Our work presents a robust and generalizable framework for instruction-driven fine-tuning of generative models, enabling superior compositional control and fidelity to user intent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InstructMoLE's global instruction routing plus orthogonality loss is a clean attempt to fix fragmentation in MoLE for DiTs, but the abstract's performance claims lack any numbers or ablations to evaluate them.

read the letter

InstructMoLE replaces token-level routing in mixture-of-experts setups with a single global routing signal derived from the entire user instruction, then adds an output-space orthogonality loss to keep the experts from collapsing into similar functions. That's the main contribution worth noting. The new part is applying this global, instruction-driven choice uniformly across all tokens in a diffusion transformer, rather than letting each token pick its own experts. The abstract positions the orthogonality loss as a way to promote functional diversity. This addresses a real issue where local routing can fragment the output or drift from the overall prompt semantics. The paper explains the problem clearly and shows how the global approach maintains coherence. It does well at framing the limitation of existing MoLE variants for multi-conditional tasks and proposing a modular fix that stays parameter-efficient. The idea of deriving the routing from the full instruction makes sense for preserving global intent. Where it gets soft is the experimental support. The claim of significant outperformance over LoRA adapters and other MoLE methods is stated, but the abstract gives no quantitative results, ablation studies, or details on the benchmarks. Without those, it's difficult to tell if the gains are robust or depend on specific choices. The central assumption—that one expert combination works for the whole image—also needs checking. In cases with spatially varying conditions, like different objects requiring distinct styles or colors, a uniform routing might average out necessary local adjustments. The orthogonality loss ensures diversity but doesn't prove the global signal is sufficient. If the full paper has experiments isolating this, that would strengthen it; otherwise the superiority claim rests on untested ground. This paper is aimed at people working on efficient fine-tuning of generative models, especially for applications needing tight control over multiple conditions. A reader focused on adapter design or routing mechanisms would find the routing change useful to consider. I would send it for peer review. The idea is straightforward and targets a practical pain point, so referees could help verify the results and test the global routing assumption more rigorously.

Referee Report

2 major / 2 minor

Summary. The paper introduces InstructMoLE for parameter-efficient fine-tuning of Diffusion Transformers (DiTs) in multi-conditional image generation. It replaces token-level routing in Mixture of Low-rank Experts (MoLE) with Instruction-Guided Routing (IGR), which derives a single global expert combination from the full user instruction and applies it uniformly to all tokens. An output-space orthogonality loss is added to encourage expert diversity and prevent collapse. The central claim is that this global mechanism outperforms standard LoRA adapters and prior MoLE variants by preserving global semantics and reducing spatial fragmentation and semantic drift.

Significance. If the performance claims hold with proper controls, the work offers a coherent alternative to local routing for instruction-driven generation, potentially improving compositional fidelity in tasks with global user intent. The IGR mechanism and orthogonality loss are presented as targeted fixes for known MoLE limitations in DiT fine-tuning.

major comments (2)

[Abstract] Abstract: the claim that 'extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants' is load-bearing for the central contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, or specific benchmark scores, preventing assessment of whether gains are attributable to IGR or to baseline selection.
[Method (IGR)] Routing mechanism (IGR description): the global routing decision is derived from the full instruction embedding and broadcast uniformly to every DiT token; for spatially heterogeneous prompts (e.g., 'red car on left, blue sky on right'), this uniform application risks averaging away necessary local low-rank adaptations, and no experiment isolates the fidelity loss relative to per-token routing.

minor comments (2)

[Training objective] Clarify the precise mathematical definition of the output-space orthogonality loss, including its weighting coefficient in the total objective and how orthogonality is measured across expert outputs.
[Model architecture] Provide the exact architecture details for the instruction encoder used to produce the IGR signal and how it interfaces with the DiT blocks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major point below, clarifying our design choices and proposing revisions where appropriate to strengthen the presentation of InstructMoLE.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'extensive experiments demonstrate that InstructMoLE significantly outperforms existing LoRA adapters and MoLE variants' is load-bearing for the central contribution, yet the abstract supplies no quantitative metrics, ablation tables, error bars, or specific benchmark scores, preventing assessment of whether gains are attributable to IGR or to baseline selection.

Authors: We agree that the abstract would be strengthened by including concrete quantitative results. In the revised manuscript, we will update the abstract to report key metrics from our experiments, including specific improvements in FID scores, CLIP similarity, and other benchmarks on multi-conditional generation tasks, along with direct comparisons to LoRA and prior MoLE variants. This will provide readers with immediate evidence of the performance gains. revision: yes
Referee: [Method (IGR)] Routing mechanism (IGR description): the global routing decision is derived from the full instruction embedding and broadcast uniformly to every DiT token; for spatially heterogeneous prompts (e.g., 'red car on left, blue sky on right'), this uniform application risks averaging away necessary local low-rank adaptations, and no experiment isolates the fidelity loss relative to per-token routing.

Authors: We acknowledge the potential concern that uniform global routing could average out local adaptations for highly spatially heterogeneous prompts. However, our motivation for IGR stems from empirical observations that per-token routing in MoLE frequently produces spatial fragmentation and semantic drift when instructions convey global intent, as demonstrated in our qualitative results and failure case analyses. To directly address the request for isolation, we will include a new ablation in the revised manuscript comparing IGR to a per-token routing variant on spatially heterogeneous prompts, reporting metrics for both global coherence and regional fidelity. revision: partial

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained

full rationale

The paper defines Instruction-Guided Routing (IGR) as a global routing signal extracted from the full user instruction and an output-space orthogonality loss to encourage expert diversity. These components are introduced as architectural responses to the stated limitations of token-level routing in prior MoLE variants. Performance claims rest on experimental comparisons against LoRA and MoLE baselines on multi-conditional benchmarks rather than any equation that reduces the reported gains to a fitted parameter, self-referential definition, or self-citation chain. No load-bearing uniqueness theorems, ansatzes imported via author citations, or renamings of known results appear in the provided text. The central claims therefore remain independent of the inputs they are evaluated against.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The framework introduces two new mechanisms (global instruction-guided routing and output-space orthogonality loss) whose functional forms are not derived from prior literature; no explicit free parameters or background axioms beyond standard neural-network training are stated in the abstract.

invented entities (2)

Instruction-Guided Routing (IGR) no independent evidence
purpose: Derive a single global expert-selection signal from the complete user instruction instead of per-token decisions
New routing policy introduced to address spatial fragmentation and semantic drift
output-space orthogonality loss no independent evidence
purpose: Encourage functional diversity among the low-rank experts to prevent representational collapse
New auxiliary loss term proposed to complement the routing change

pith-pipeline@v0.9.0 · 5569 in / 1282 out tokens · 26173 ms · 2026-05-16T19:07:27.995205+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Instead of per-token routing, InstructMoLE utilizes a global routing signal, Instruction-Guided Routing (IGR), derived from the user's comprehensive instruction. This ensures that a single, coherently chosen expert council is applied uniformly across all input tokens
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we introduce an output-space orthogonality loss, which promotes expert functional diversity and mitigates representational collapse

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

[1]

Cunningham

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John P. Cunningham. Lora learns less and forgets less, 2024. URLhttps://arxiv.org/abs/2405.09673

work page arXiv 2024
[2]

Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

Bowen Chen, Mengyi Zhao, Haomiao Sun, Li Chen, Xu Wang, Kang Du, and Xinglong Wu. Xverse: Consistent multi-subject control of identity and semantic attributes via dit modulation.arXiv preprint arXiv:2506.21416, 2025

work page arXiv 2025
[3]

Octavius: Mitigating task interference in mllms via lora-moe

Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, and Jing Shao. Octavius: Mitigating task interference in mllms via lora-moe. InThe Twelfth International Conference on Learning Representations

work page
[4]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

work page 2019
[5]

Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Wei Shen, Limao Xiong, Yuhao Zhou, Xiao Wang, Zhiheng Xi, Xiaoran Fan, et al. Loramoe: Alleviating world knowledge forgetting in large language models via moe-style plugin. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1932–1945, 2024

work page 1932
[6]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

work page 2024
[7]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research (JMLR), 23(120):1–39, 2022

work page 2022
[8]

arXiv preprint arXiv:2407.11633 , year=

Zhengcong Fei, Mingyuan Fan, Changqian Yu, Debang Li, and Junshi Huang. Scaling diffusion transformers to 16 billion parameters, 2024. URLhttps://arxiv.org/abs/2407.11633

work page arXiv 2024
[9]

Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning.arXiv preprint arXiv:2312.12379, 2023

work page arXiv 2023
[10]

Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

Zeyu Han, Chao Gao, Jinyang Liu, Jeff Zhang, and Sai Qian Zhang. Parameter-efficient fine-tuning for large models: A comprehensive survey, 2024. URLhttps://arxiv.org/abs/2403.14608

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[12]

Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment, 2024

work page 2024
[13]

Perceiver: General perception with iterative attention

Andrew Jaegle, Felix Gimeno, Andy Brock, Oriol Vinyals, Andrew Zisserman, and Joao Carreira. Perceiver: General perception with iterative attention. InInternational conference on machine learning, pages 4651–4664. PMLR, 2021

work page 2021
[14]

Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

work page arXiv 2025
[15]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

work page 2024
[16]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts.CoRR, 2024

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhiyuan Cheng, Lei Duan, Jie Zuo, Cal Yang, and Mingjie Tang. Mixlora: Enhancing large language models fine-tuning with lora based mixture of experts.CoRR, 2024

work page 2024
[18]

Theory on mixture-of-experts in continual learning

Hongbo Li, Sen Lin, Lingjie Duan, Yingbin Liang, and Ness Shroff. Theory on mixture-of-experts in continual learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=7XgKAabsPp. 12

work page 2025
[19]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEuropean conference on computer vision, pages 740–755. Springer, 2014

work page 2014
[20]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, Guopeng Li, Yuang Peng, Quan Sun, Jingwei Wu, Yan Cai, Zheng Ge, Ranchen Ming, Lei Xia, Xianfang Zeng, Yibo Zhu, Binxing Jiao, Xiangyu Zhang, Gang Yu, and Daxin Jiang. Step1x-edit: A practical framework for general image editing, 2025. UR...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Beyond standard moe: Mixture of latent experts for resource-efficient language models.arXiv e-prints, pages arXiv–2503, 2025

Zehua Liu, Han Wu, Ruifeng She, Xiaojin Fu, Xiongwei Han, Tao Zhong, and Mingxuan Yuan. Beyond standard moe: Mixture of latent experts for resource-efficient language models.arXiv e-prints, pages arXiv–2503, 2025

work page 2025
[22]

Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, and Xiangxiang Chu. Omni-effects: Unified and spatially-controllable visual effects generation.arXiv preprint arXiv:2508.07981, 2025

work page arXiv 2025
[23]

Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization.arXiv preprint arXiv:2504.16915, 2025

work page arXiv 2025
[24]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patrick Lab...

work page 2024
[25]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

work page 2023
[26]

Unicontrol: A unified diffusion model for controllable visual generation in the wild

Can Qin, Shu Zhang, Ning Yu, Yihao Feng, Xinyi Yang, Yingbo Zhou, Huan Wang, Juan Carlos Niebles, Caiming Xiong, Silvio Savarese, et al. Unicontrol: A unified diffusion model for controllable visual generation in the wild. Advances in Neural Information Processing Systems, 36:42961–42992, 2023

work page 2023
[27]

Outrageously large neural networks: The sparsely-gated mixture-of-experts layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017

work page 2017
[28]

Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing

Haotian Sun, Tao Lei, Bowen Zhang, Yanghao Li, Haoshuo Huang, Ruoming Pang, Bo Dai, and Nan Du. Ec-dit: Scaling diffusion transformers with adaptive expert-choice routing. InThe Thirteenth International Conference on Learning Representations

work page
[29]

A stronger mixture of low-rank experts for fine-tuning foundation models

Mengyang Sun, Yihao Wang, Tao Feng, Dan Zhang, Yifan Zhu, and Jie Tang. A stronger mixture of low-rank experts for fine-tuning foundation models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=yqyEUcGreT

work page 2025
[30]

Gpt-image-edit- 1.5m: A million-scale, gpt-generated image dataset, 2025

Yuhan Wang, Siwei Yang, Bingchen Zhao, Letian Zhang, Qing Liu, Yuyin Zhou, and Cihang Xie. Gpt-image-edit- 1.5m: A million-scale, gpt-generated image dataset, 2025. URLhttps://arxiv.org/abs/2507.21033

work page arXiv 2025
[31]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, Deqing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingkun...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

OmniGen2: Towards Instruction-Aligned Multimodal Generation

Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. Omnigen2: Exploration to advanced multimodal generation.arXiv preprint arXiv:2506.18871, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Routing experts: Learning to route dynamic experts in existing multi-modal large language models

Qiong Wu, Zhaoxi Ke, Yiyi Zhou, Xiaoshuai Sun, and Rongrong Ji. Routing experts: Learning to route dynamic experts in existing multi-modal large language models. InThe Thirteenth International Conference on Learning Representations, 2025. 13

work page 2025
[34]

Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

Shaojin Wu, Mengqi Huang, Wenxu Wu, Yufeng Cheng, Fei Ding, and Qian He. Less-to-more generalization: Unlocking more controllability by in-context generation.arXiv preprint arXiv:2504.02160, 2025

work page arXiv 2025
[35]

Mixture of lora experts

Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts. InThe Twelfth International Conference on Learning Representations

work page
[36]

Ecospa: Efficient transformer training with coupled sparsity

Jinqi Xiao, Cheng Luo, Lingyi Huang, Cheng Yang, Yang Sui, Huy Phan, Xiao Zang, Yibiao Ying, Anima Anandkumar, and Bo Yuan. Ecospa: Efficient transformer training with coupled sparsity. InNeurIPS 2025 Workshop on Efficient Reasoning

work page 2025
[37]

Comcat: Towards efficient compression and customization of attention-based vision models

Jinqi Xiao, Miao Yin, Yu Gong, Xiao Zang, Jian Ren, and Bo Yuan. Comcat: Towards efficient compression and customization of attention-based vision models. InInternational Conference on Machine Learning, pages 38125–38136. PMLR, 2023

work page 2023
[38]

Haloc: hardware-aware automatic low-rank compression for compact neural networks

Jinqi Xiao, Chengming Zhang, Yu Gong, Miao Yin, Yang Sui, Lizhi Xiang, Dingwen Tao, and Bo Yuan. Haloc: hardware-aware automatic low-rank compression for compact neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10464–10472, 2023

work page 2023
[39]

Coap: Memory-efficient training with correlation-aware gradient projection.arXiv preprint arXiv:2412.00071, 2024

Jinqi Xiao, Shen Sang, Tiancheng Zhi, Jing Liu, Qing Yan, Yuqian Zhang, Linjie Luo, and Bo Yuan. Coap: Memory-efficient training with correlation-aware gradient projection.arXiv preprint arXiv:2412.00071, 2024

work page arXiv 2024
[40]

Moe-i2: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Yuanlin Duan, Wenqi Jia, Miao Yin, Yu Cheng, and Bo Yuan. Moe-i2: Compressing mixture of experts models through inter-expert pruning and intra-expert low-rank decomposition. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 10456–10466, 2024

work page 2024
[41]

URLhttps://arxiv.org/ abs/2504.13143

Siwei Yang, Mude Hui, Bingchen Zhao, Yuyin Zhou, Nataniel Ruiz, and Cihang Xie.Complex-Edit: Cot-like instruction generation for complexity-controllable image editing benchmark, 2025. URLhttps://arxiv.org/ abs/2504.13143

work page arXiv 2025
[42]

Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts

Yike Yuan, Ziyu Wang, Zihao Huang, Defa Zhu, Xun Zhou, Jingyi Yu, and Qiyang Min. Expert race: A flexible routing strategy for scaling diffusion transformer with mixture of experts. InForty-second International Conference on Machine Learning

work page
[43]

Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

Yuxuan Zhang, Yirui Yuan, Yiren Song, Haofan Wang, and Jiaming Liu. Easycontrol: Adding efficient and flexible control for diffusion transformer.arXiv preprint arXiv:2503.07027, 2025

work page arXiv 2025
[44]

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer

Zechuan Zhang, Ji Xie, Yu Lu, Zongxin Yang, and Yi Yang. In-context edit: Enabling instructional image editing with in-context generation in large scale diffusion transformer.arXiv preprint arXiv:2504.20690, 2025. 14 Appendix .1 The Use of Large Language Models (LLMs) We acknowledge the use of a large language model (LLM) to aid in the writing process of ...

work page internal anchor Pith review Pith/arXiv arXiv 2025