LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing
Pith reviewed 2026-05-19 09:02 UTC · model grok-4.3
The pith
LoRA-Mixer routes task-specific LoRA experts through attention input and output layers to deliver token-level specialization in LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LoRA-Mixer coordinates modular LoRA experts through serial attention routing by inserting them into the input and output linear layers of the attention module, employing an adaptive Routing Specialization Loss to enforce global balance and input-aware specialization via entropy shaping. The framework supports joint optimization with differentiable hard-soft top-k routing or plug-and-play routing over frozen pre-trained LoRAs, and across 15 benchmarks it outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C respectively.
What carries the argument
Serial attention routing of LoRA experts inserted into the attention module's input and output linear layers, which exploits the attention mechanism itself to achieve fine-grained token-level specialization.
If this is right
- The approach outperforms prior methods on 15 benchmarks including MedQA, GSM8K, HumanEval, and GLUE while using only 48 percent of the trainable parameters.
- It supports both joint optimization of adapters and router as well as plug-and-play routing over frozen pre-trained LoRA modules.
- Cross-model transfer and adapter reuse experiments show versatility and data efficiency.
- The design remains drop-in compatible with Transformers and state-space models because it targets ubiquitous linear projection layers.
Where Pith is reading between the lines
- The attention-focused placement may extend to other architectures where linear projections dominate, such as certain vision or multimodal models.
- Fewer parameters per task could allow scaling to larger numbers of tasks or experts under the same compute budget.
- The serial routing idea might combine with other adaptation methods to further reduce interference between tasks.
Load-bearing premise
That inserting LoRA experts specifically into the input and output linear layers of the attention module rather than FFN blocks produces fine-grained token-level specialization while keeping the method compatible with Transformers and state-space models.
What would settle it
A controlled experiment that places the same LoRA experts into FFN blocks instead of attention layers and measures whether the reported performance gains and parameter reduction on GSM8K and ARC-C disappear.
Figures
read the original abstract
Recent attempts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for multi-task adaptation of Large Language Models (LLMs) often replace whole attention/FFN layers with switch experts or append parallel expert branches, undermining parameter efficiency and limiting task specialization. We introduce LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the core projection matrices of the attention module, namely input and output linear layers, rather than primarily targeting FFN blocks. The design delivers fine-grained token-level specialization by fully exploiting the attention mechanism, while remaining drop-in compatible with Transformers and state-space models (SSMs), since linear projection layers are ubiquitous. To train robust routers from limited data while promoting stable, selective decisions and high expert reuse, LoRA-Mixer employs an adaptive Routing Specialization Loss (RSL) that jointly enforces global load balance and input-aware specialization via an entropy-shaping objective. The framework supports two regimes: (i) joint optimization of adapters and router with a differentiable hard-soft top-k routing scheme, and (ii) plug-and-play routing over frozen, pre-trained LoRA modules sourced from public repositories. Across 15 benchmarks, including MedQA, GSM8K, HumanEval, and GLUE, RSL-optimized LoRA-Mixer outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C, respectively. Cross-model transfer and adapter reuse experiments further demonstrate the approach's versatility and data efficiency. Our code is available at https://github.com/hustcselwb/LoRA-Mixer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the input and output linear layers of the attention module (rather than primarily FFN blocks) for fine-grained token-level specialization in Transformers and state-space models. It introduces an adaptive Routing Specialization Loss (RSL) that enforces global load balance and input-aware specialization via entropy shaping, supporting both joint optimization with differentiable hard-soft top-k routing and plug-and-play use of frozen pre-trained adapters. Across 15 benchmarks the RSL-optimized model is reported to outperform routing and LoRA-MoE baselines while using 48% of their trainable parameters, with concrete gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C respectively; cross-model transfer and adapter-reuse experiments are also presented.
Significance. If the empirical claims are substantiated, the work would offer a practical advance in parameter-efficient multi-task adaptation by showing that attention-layer LoRA placement plus a specialized routing objective can improve both performance and efficiency over prior LoRA-MoE designs while remaining compatible with existing model families. The dual support for joint training and reuse of public adapters, together with the explicit code release, would strengthen its utility for data-efficient and modular fine-tuning scenarios.
major comments (2)
- [Abstract] Abstract: the headline claim that LoRA-Mixer uses 48% of the trainable parameters of the compared baselines is load-bearing for the efficiency argument, yet no breakdown is supplied (expert count, LoRA rank r, number of routed layers, or exact baseline configurations). Without these counts it is impossible to determine whether the reported savings arise from the serial attention routing and RSL objective or simply from deploying fewer adapters overall.
- [Abstract] Abstract / results section: the reported gains (3.79 pp on GSM8K, 2.90 pp on CoLA, 3.95 pp on ARC-C) and the overall outperformance across 15 benchmarks are presented without statistical significance, standard deviations across runs, or detailed descriptions of baseline implementations and data splits. These omissions undermine the robustness of the central performance claim.
minor comments (2)
- [Introduction] The two training regimes (joint optimization versus plug-and-play routing) are introduced in the abstract but would benefit from an explicit side-by-side comparison table or diagram early in the manuscript to clarify their respective hyper-parameter settings and data requirements.
- [Method] Notation for the RSL objective and the hard-soft top-k routing scheme should be introduced with a single consolidated equation block rather than scattered references, to improve readability for readers unfamiliar with entropy-shaping losses.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim that LoRA-Mixer uses 48% of the trainable parameters of the compared baselines is load-bearing for the efficiency argument, yet no breakdown is supplied (expert count, LoRA rank r, number of routed layers, or exact baseline configurations). Without these counts it is impossible to determine whether the reported savings arise from the serial attention routing and RSL objective or simply from deploying fewer adapters overall.
Authors: We agree that an explicit parameter breakdown is required to support the efficiency claim. The 48% figure is obtained by comparing our configuration (serial routing of 4-8 LoRA experts with rank r=16 into attention input/output projections across selected layers) against the baselines' higher expert counts and ranks applied primarily to FFN blocks. In the revised manuscript we will add a dedicated table in Section 4.1 (Experimental Setup) that lists expert counts, LoRA ranks, number of routed layers, total trainable parameters for LoRA-Mixer and each baseline, and the precise hyper-parameter settings used for the baselines. This addition will make clear that the reported savings derive from the attention-layer placement and serial routing rather than from simply using fewer adapters overall. revision: yes
-
Referee: [Abstract] Abstract / results section: the reported gains (3.79 pp on GSM8K, 2.90 pp on CoLA, 3.95 pp on ARC-C) and the overall outperformance across 15 benchmarks are presented without statistical significance, standard deviations across runs, or detailed descriptions of baseline implementations and data splits. These omissions undermine the robustness of the central performance claim.
Authors: We acknowledge that the current presentation lacks explicit statistical support. Although the full experimental section reports results averaged over multiple random seeds for the primary benchmarks, we did not include per-run standard deviations, p-values, or exhaustive baseline implementation details in the abstract or main tables. In the revision we will (i) add standard deviations and 95% confidence intervals to all reported metrics in Tables 2-4, (ii) include paired t-test p-values for the highlighted gains on GSM8K, CoLA, and ARC-C, and (iii) expand the appendix with complete baseline hyper-parameters, data-split specifications, and training protocols. These changes will be reflected in both the abstract and the results section. revision: yes
Circularity Check
No circularity: empirical method with benchmark validation
full rationale
The paper introduces LoRA-Mixer as an architectural design placing LoRA experts in attention input/output projections, paired with an RSL training objective for routing balance and specialization. All performance claims (outperformance on 15 benchmarks, 48% parameter usage) are presented as empirical outcomes from experiments rather than any first-principles derivation, prediction, or uniqueness theorem. No equations reduce a claimed result to a fitted input by construction, and no self-citation chain bears the central load; the RSL loss is defined and then validated through results. The derivation chain is therefore self-contained as a proposal plus empirical evidence.
Axiom & Free-Parameter Ledger
free parameters (1)
- top-k routing hyperparameters
axioms (1)
- domain assumption Linear projection layers are ubiquitous in Transformers and state-space models
invented entities (1)
-
Routing Specialization Loss (RSL)
no independent evidence
Forward citations
Cited by 6 Pith papers
-
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
-
GateMOT: Q-Gated Attention for Dense Object Tracking
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
-
OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
-
HotComment: A Benchmark for Evaluating Popularity of Online Comments
HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
-
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
-
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.
Reference graph
Works this paper leans on
-
[1]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[2]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
work page 2022
-
[5]
Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng-Zhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024
work page 2024
-
[6]
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022
work page 2022
-
[7]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Dora: Weight-decomposed low-rank adaptation
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. 2024
work page 2024
-
[9]
AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning
Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,
Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta- lora: Fine-tuning high-rank parameters with the delta of low-rank matrices.arXiv preprint arXiv:2309.02411, 2023
-
[11]
Lora-drop: Efficient lora parameter pruning based on output evaluation,
Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao, and Muyun Yang. Lora-drop: Efficient lora parameter pruning based on output evaluation.arXiv preprint arXiv:2402.07721, 2024
-
[12]
Lora+: Efficient low rank adaptation of large models,
Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024
-
[13]
Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison
Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024
-
[14]
Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024a
Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024
- [15]
-
[16]
Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, and Fei Wu. Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering.arXiv preprint arXiv:2409.16167, 2024. 10
-
[17]
Chenghao Fan, Zhenyi Lu, Sichen Liu, Xiaoye Qu, Wei Wei, Chengfeng Gu, and Yu Cheng. Make lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization alignment.arXiv preprint arXiv:2502.16894, 2025
-
[18]
Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism
Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, and Mingjie Tang. Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism. arXiv preprint arXiv:2504.00661, 2025
-
[19]
Zhanbo Huang, Xiaoming Liu, and Yu Kong. H-more: Learning human-centric motion representation for action analysis.arXiv preprint arXiv:2504.10676, 2025
-
[20]
Glam: Efficient scaling of language models with mixture-of-experts
Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. pages 5547–5569, 2022
work page 2022
-
[21]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024
-
[24]
Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979, 2023
-
[25]
Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025
-
[26]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. 2023
work page 2023
-
[28]
Falcon mamba: The first competitive attention-free 7b language model
Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guil- laume Kunsch, and Hakim Hacid. Falcon mamba: The first competitive attention-free 7b language model. 2024
work page 2024
-
[29]
Training verifiers to solve math word problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021
work page 2021
-
[30]
Think you have solved question answering? try arc, the ai2 reasoning challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. 2018
work page 2018
-
[31]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. 2019
work page 2019
-
[32]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page 2021
-
[33]
Llama 2: Open foundation and fine-tuned chat models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...
work page 2023
-
[34]
Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,
Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts. arXiv preprint arXiv:2402.08562, 2024
-
[35]
Hmora: Making llms more effective with hierarchical mixture of lora experts
Mengqi Liao, Wei Chen, Junfeng Shen, Shengnan Guo, and Huaiyu Wan. Hmora: Making llms more effective with hierarchical mixture of lora experts. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[36]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024
work page 2024
-
[37]
Prefix-tuning: Optimizing continuous prompts for generation
Xiao Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 3458–3470, 2021
work page 2021
-
[39]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models
Elad Zaken, Shauli Ravfogel, Yoav Lang, Ran El-Yaniv, and Naftali Tishby. Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1164–1174, 2021
work page 2021
-
[40]
Xiao Liu, Yanan Zeng, Zheng Liu, Xiao Ding, Yujie Du, Jie Huang, Yixin Nie, Jilan Zhang, Zhiyuan Zhou, Chang Ren, et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2303.03417, 2023
-
[41]
Parameter-efficient transfer learning for nlp
Neil Houlsby, Sebastian Jastrzebski, Andrzej Brooks, Rosanne de Vries, Andrea Guedj, and Grégory Nematzadeh. Parameter-efficient transfer learning for nlp. InProceedings of the 36th International Conference on Machine Learning, volume 97, pages 2791–2800, 2019
work page 2019
-
[42]
Xiao Liu, Kaixuan Peng, Zheng Zhao, Ying Song, Xinyu Tan, Chen Wang, Ming Lyu, Weinan Zhou, Jin Yang, Jianlin Su, et al. Gpt understands, too. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 1016–1024, 2021
work page 2021
-
[43]
M2e: Multi-granular mixture of experts for neural machine translation
Xue Zhang, Boxing Zhao, Li Feng, Bo Zhou, and Xu Yu. M2e: Multi-granular mixture of experts for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2127–2137, 2018. 12
work page 2018
-
[44]
Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, and Alessandro Sordoni. Towards modular llms by building and reusing a library of loras.arXiv preprint arXiv:2405.11157, 2024
-
[45]
Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv: 2402.11260, 2024b
Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, and Di Wang. Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024
-
[46]
Octavius: Mitigating task interference in mllms via lora-moe
Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, and Jing Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023
-
[47]
Jingwei Xu, Junyu Lai, and Yunpeng Huang. Meteora: Multiple-tasks embedded lora for large language models.arXiv preprint arXiv:2405.13053, 2024
-
[48]
Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851, 2024
-
[49]
S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023
Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023
-
[50]
Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024
-
[51]
Loftq: Lora- fine-tuning-aware quantization for large language models
Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023
-
[52]
arXiv preprint arXiv:2402.07871 , year=
Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024
-
[53]
Learning a mixture of granularity- specific experts for fine-grained categorization
Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity- specific experts for fine-grained categorization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8331–8340, 2019
work page 2019
-
[54]
Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022
Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, and Ziwei Liu. Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022
-
[55]
Hard mixtures of experts for large scale weakly supervised vision
Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale weakly supervised vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017
work page 2017
-
[56]
Universal language model fine-tuning for text classi- fication
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Melbourne, Australia, July 2018. Association for Compu- tational Linguistics
work page 2018
-
[57]
Adapterfusion: Non-destructive task composition for transfer learning
Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Jack W Rae, Stephen King, and Sebastian Ruder. Adapterfusion: Non-destructive task composition for transfer learning. InAdvances in Neural Information Processing Systems, volume 32, pages 6649–6659, 2019
work page 2019
-
[58]
MAD-X: An adapter-based framework for multi-task cross-lingual transfer
Jonas Pfeiffer, Andreas Rücklé, Christian Poth, Aishwarya Anil, Ivan Texier, Sebastian Michael, and Iryna Gurevych. MAD-X: An adapter-based framework for multi-task cross-lingual transfer. InInternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7430–7439. PMLR, 2020
work page 2020
-
[59]
Parameter-efficient transfer learning with transformers
Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Andrea Waldon, and Stephen King. Parameter-efficient transfer learning with transformers. In International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2791–2800. PMLR, 2019. 13
work page 2019
-
[60]
Coupled mamba: Enhanced multi-modal fusion with coupled state space model
Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model.arXiv preprint arXiv:2405.18014, 2024
-
[61]
Dmitry Lepikhin, Hyoukjun Mehdad, Mostafa Shen, Tao Xu, Yanping Chen, Dmitry Krikun, and Minh-Thang Luong. Gshard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021. 14 A Experiment Result Table 9: Comparison of LoRA-Mixer on Falcon-Mamba, Mistral, and LLaMA across seven ...
-
[62]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.