LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

Hang Zhou; Junqing Yu; Wei Yang; Wenbing Li; Yunyao Zhang; Zikai Song

arxiv: 2507.00029 · v2 · pith:3OQ3RTWGnew · submitted 2025-06-17 · 💻 cs.LG · cs.AI

LoRA-Mixer: Coordinate Modular LoRA Experts Through Serial Attention Routing

Wenbing Li , Zikai Song , Hang Zhou , Yunyao Zhang , Junqing Yu , Wei Yang This is my paper

Pith reviewed 2026-05-19 09:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LoRAmixture of expertsparameter-efficient fine-tuningattention routingmulti-task adaptationrouting specialization losslarge language modelsstate-space models

0 comments

The pith

LoRA-Mixer routes task-specific LoRA experts through attention input and output layers to deliver token-level specialization in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LoRA-Mixer as a framework that places modular low-rank adaptation experts directly into the linear projection matrices of the attention module rather than replacing full layers or targeting feed-forward blocks. It coordinates these experts using serial attention routing combined with a Routing Specialization Loss that balances load and encourages selective, input-aware decisions. This setup is designed to remain compatible with both standard Transformers and state-space models while supporting either joint training or plug-and-play use of existing adapters. A sympathetic reader would care because the reported results show higher accuracy on diverse tasks with substantially fewer trainable parameters than prior mixture-of-experts LoRA approaches.

Core claim

LoRA-Mixer coordinates modular LoRA experts through serial attention routing by inserting them into the input and output linear layers of the attention module, employing an adaptive Routing Specialization Loss to enforce global balance and input-aware specialization via entropy shaping. The framework supports joint optimization with differentiable hard-soft top-k routing or plug-and-play routing over frozen pre-trained LoRAs, and across 15 benchmarks it outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C respectively.

What carries the argument

Serial attention routing of LoRA experts inserted into the attention module's input and output linear layers, which exploits the attention mechanism itself to achieve fine-grained token-level specialization.

If this is right

The approach outperforms prior methods on 15 benchmarks including MedQA, GSM8K, HumanEval, and GLUE while using only 48 percent of the trainable parameters.
It supports both joint optimization of adapters and router as well as plug-and-play routing over frozen pre-trained LoRA modules.
Cross-model transfer and adapter reuse experiments show versatility and data efficiency.
The design remains drop-in compatible with Transformers and state-space models because it targets ubiquitous linear projection layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention-focused placement may extend to other architectures where linear projections dominate, such as certain vision or multimodal models.
Fewer parameters per task could allow scaling to larger numbers of tasks or experts under the same compute budget.
The serial routing idea might combine with other adaptation methods to further reduce interference between tasks.

Load-bearing premise

That inserting LoRA experts specifically into the input and output linear layers of the attention module rather than FFN blocks produces fine-grained token-level specialization while keeping the method compatible with Transformers and state-space models.

What would settle it

A controlled experiment that places the same LoRA experts into FFN blocks instead of attention layers and measures whether the reported performance gains and parameter reduction on GSM8K and ARC-C disappear.

Figures

Figures reproduced from arXiv: 2507.00029 by Hang Zhou, Junqing Yu, Wei Yang, Wenbing Li, Yunyao Zhang, Zikai Song.

**Figure 2.** Figure 2: The overall architecture of LoRA-Mixer. LoRA-Mixer is applied to the linear projection [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Expert Assignment Overview. As K increases from 1 to 3, we observe that the accuracy of both tasks improves, indicating that using multiple experts allows the model to obtain complementary information. However, further increasing the value of K does not guarantee better results, but may degrade the performance. Therefore, the setting of K is crucial for the MoE model. How to set or dynamically learn the mo… view at source ↗

**Figure 5.** Figure 5: Expert Load Distribution across Tasks [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Balance loss curve using RSL loss during training. [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Recent attempts to combine low-rank adaptation (LoRA) with mixture-of-experts (MoE) for multi-task adaptation of Large Language Models (LLMs) often replace whole attention/FFN layers with switch experts or append parallel expert branches, undermining parameter efficiency and limiting task specialization. We introduce LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the core projection matrices of the attention module, namely input and output linear layers, rather than primarily targeting FFN blocks. The design delivers fine-grained token-level specialization by fully exploiting the attention mechanism, while remaining drop-in compatible with Transformers and state-space models (SSMs), since linear projection layers are ubiquitous. To train robust routers from limited data while promoting stable, selective decisions and high expert reuse, LoRA-Mixer employs an adaptive Routing Specialization Loss (RSL) that jointly enforces global load balance and input-aware specialization via an entropy-shaping objective. The framework supports two regimes: (i) joint optimization of adapters and router with a differentiable hard-soft top-k routing scheme, and (ii) plug-and-play routing over frozen, pre-trained LoRA modules sourced from public repositories. Across 15 benchmarks, including MedQA, GSM8K, HumanEval, and GLUE, RSL-optimized LoRA-Mixer outperforms state-of-the-art routing and LoRA-MoE baselines while using 48 percent of their trainable parameters, with gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C, respectively. Cross-model transfer and adapter reuse experiments further demonstrate the approach's versatility and data efficiency. Our code is available at https://github.com/hustcselwb/LoRA-Mixer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LoRA-Mixer routes LoRA experts into attention projections with a new RSL loss and reports benchmark gains at lower parameter cost, but the efficiency numbers need explicit counts to hold up.

read the letter

The core idea is routing task-specific LoRA experts directly into the input and output linear layers of attention rather than FFN blocks, combined with an entropy-shaping Routing Specialization Loss that tries to balance load and encourage specialization during training. They also support a plug-and-play mode with frozen public adapters and a joint training regime with hard-soft top-k routing. This setup is meant to work as a drop-in for both transformers and state-space models since it only touches ubiquitous linear projections.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes LoRA-Mixer, a modular MoE framework that routes task-specific LoRA experts into the input and output linear layers of the attention module (rather than primarily FFN blocks) for fine-grained token-level specialization in Transformers and state-space models. It introduces an adaptive Routing Specialization Loss (RSL) that enforces global load balance and input-aware specialization via entropy shaping, supporting both joint optimization with differentiable hard-soft top-k routing and plug-and-play use of frozen pre-trained adapters. Across 15 benchmarks the RSL-optimized model is reported to outperform routing and LoRA-MoE baselines while using 48% of their trainable parameters, with concrete gains of 3.79, 2.90, and 3.95 percentage points on GSM8K, CoLA, and ARC-C respectively; cross-model transfer and adapter-reuse experiments are also presented.

Significance. If the empirical claims are substantiated, the work would offer a practical advance in parameter-efficient multi-task adaptation by showing that attention-layer LoRA placement plus a specialized routing objective can improve both performance and efficiency over prior LoRA-MoE designs while remaining compatible with existing model families. The dual support for joint training and reuse of public adapters, together with the explicit code release, would strengthen its utility for data-efficient and modular fine-tuning scenarios.

major comments (2)

[Abstract] Abstract: the headline claim that LoRA-Mixer uses 48% of the trainable parameters of the compared baselines is load-bearing for the efficiency argument, yet no breakdown is supplied (expert count, LoRA rank r, number of routed layers, or exact baseline configurations). Without these counts it is impossible to determine whether the reported savings arise from the serial attention routing and RSL objective or simply from deploying fewer adapters overall.
[Abstract] Abstract / results section: the reported gains (3.79 pp on GSM8K, 2.90 pp on CoLA, 3.95 pp on ARC-C) and the overall outperformance across 15 benchmarks are presented without statistical significance, standard deviations across runs, or detailed descriptions of baseline implementations and data splits. These omissions undermine the robustness of the central performance claim.

minor comments (2)

[Introduction] The two training regimes (joint optimization versus plug-and-play routing) are introduced in the abstract but would benefit from an explicit side-by-side comparison table or diagram early in the manuscript to clarify their respective hyper-parameter settings and data requirements.
[Method] Notation for the RSL objective and the hard-soft top-k routing scheme should be introduced with a single consolidated equation block rather than scattered references, to improve readability for readers unfamiliar with entropy-shaping losses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim that LoRA-Mixer uses 48% of the trainable parameters of the compared baselines is load-bearing for the efficiency argument, yet no breakdown is supplied (expert count, LoRA rank r, number of routed layers, or exact baseline configurations). Without these counts it is impossible to determine whether the reported savings arise from the serial attention routing and RSL objective or simply from deploying fewer adapters overall.

Authors: We agree that an explicit parameter breakdown is required to support the efficiency claim. The 48% figure is obtained by comparing our configuration (serial routing of 4-8 LoRA experts with rank r=16 into attention input/output projections across selected layers) against the baselines' higher expert counts and ranks applied primarily to FFN blocks. In the revised manuscript we will add a dedicated table in Section 4.1 (Experimental Setup) that lists expert counts, LoRA ranks, number of routed layers, total trainable parameters for LoRA-Mixer and each baseline, and the precise hyper-parameter settings used for the baselines. This addition will make clear that the reported savings derive from the attention-layer placement and serial routing rather than from simply using fewer adapters overall. revision: yes
Referee: [Abstract] Abstract / results section: the reported gains (3.79 pp on GSM8K, 2.90 pp on CoLA, 3.95 pp on ARC-C) and the overall outperformance across 15 benchmarks are presented without statistical significance, standard deviations across runs, or detailed descriptions of baseline implementations and data splits. These omissions undermine the robustness of the central performance claim.

Authors: We acknowledge that the current presentation lacks explicit statistical support. Although the full experimental section reports results averaged over multiple random seeds for the primary benchmarks, we did not include per-run standard deviations, p-values, or exhaustive baseline implementation details in the abstract or main tables. In the revision we will (i) add standard deviations and 95% confidence intervals to all reported metrics in Tables 2-4, (ii) include paired t-test p-values for the highlighted gains on GSM8K, CoLA, and ARC-C, and (iii) expand the appendix with complete baseline hyper-parameters, data-split specifications, and training protocols. These changes will be reflected in both the abstract and the results section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with benchmark validation

full rationale

The paper introduces LoRA-Mixer as an architectural design placing LoRA experts in attention input/output projections, paired with an RSL training objective for routing balance and specialization. All performance claims (outperformance on 15 benchmarks, 48% parameter usage) are presented as empirical outcomes from experiments rather than any first-principles derivation, prediction, or uniqueness theorem. No equations reduce a claimed result to a fitted input by construction, and no self-citation chain bears the central load; the RSL loss is defined and then validated through results. The derivation chain is therefore self-contained as a proposal plus empirical evidence.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework rests on standard assumptions about attention mechanisms and introduces a new loss function and routing scheme whose effectiveness is shown empirically rather than derived from first principles.

free parameters (1)

top-k routing hyperparameters
Parameters controlling the differentiable hard-soft top-k routing scheme are chosen or tuned during training.

axioms (1)

domain assumption Linear projection layers are ubiquitous in Transformers and state-space models
Invoked to support the claim of drop-in compatibility.

invented entities (1)

Routing Specialization Loss (RSL) no independent evidence
purpose: Jointly enforces global load balance and input-aware specialization via entropy shaping
New objective function introduced to train the router from limited data.

pith-pipeline@v0.9.0 · 5870 in / 1350 out tokens · 59621 ms · 2026-05-19T09:02:07.053065+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics
cs.SI 2026-04 unverdicted novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on rea...
GateMOT: Q-Gated Attention for Dense Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
OmniTrend: Content-Context Modeling for Scalable Social Popularity Prediction
cs.CV 2026-04 unverdicted novelty 6.0

OmniTrend predicts popularity by combining separate content attractiveness and contextual exposure predictors using cross-modal and exogenous signals.
HotComment: A Benchmark for Evaluating Popularity of Online Comments
cs.AI 2026-04 unverdicted novelty 6.0

HotComment is a new multimodal benchmark that quantifies online comment popularity via content quality assessment, interaction-based prediction, and agent-simulated user engagement, accompanied by the StyleCmt stylist...
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
CurEvo: Curriculum-Guided Self-Evolution for Video Understanding
cs.CV 2026-04 unverdicted novelty 4.0

CurEvo integrates curriculum guidance into self-evolution to structure autonomous improvement of video understanding models, yielding gains on VideoQA benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 6 Pith papers · 7 internal anchors

[1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022
[5]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng-Zhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

work page 2024
[6]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. 2024

work page 2024
[9]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta- lora: Fine-tuning high-rank parameters with the delta of low-rank matrices.arXiv preprint arXiv:2309.02411, 2023

work page arXiv 2023
[11]

Lora-drop: Efficient lora parameter pruning based on output evaluation,

Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao, and Muyun Yang. Lora-drop: Efficient lora parameter pruning based on output evaluation.arXiv preprint arXiv:2402.07721, 2024

work page arXiv 2024
[12]

Lora+: Efficient low rank adaptation of large models,

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[13]

Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024

work page arXiv 2024
[14]

Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024a

Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024

work page arXiv 2024
[15]

Huang, Q

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lo- rahub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

work page arXiv 2023
[16]

Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering.arXiv preprint arXiv:2409.16167, 2024

Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, and Fei Wu. Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering.arXiv preprint arXiv:2409.16167, 2024. 10

work page arXiv 2024
[17]

Make lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization alignment.arXiv preprint arXiv:2502.16894, 2025

Chenghao Fan, Zhenyi Lu, Sichen Liu, Xiaoye Qu, Wei Wei, Chengfeng Gu, and Yu Cheng. Make lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization alignment.arXiv preprint arXiv:2502.16894, 2025

work page arXiv 2025
[18]

Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism

Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, and Mingjie Tang. Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism. arXiv preprint arXiv:2504.00661, 2025

work page arXiv 2025
[19]

H-more: Learning human-centric motion representation for action analysis.arXiv preprint arXiv:2504.10676, 2025

Zhanbo Huang, Xiaoming Liu, and Yu Kong. H-more: Learning human-centric motion representation for action analysis.arXiv preprint arXiv:2504.10676, 2025

work page arXiv 2025
[20]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. pages 5547–5569, 2022

work page 2022
[21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

work page arXiv 2024
[24]

Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979, 2023

work page arXiv 2023
[25]

K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025

Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025

work page arXiv 2025
[26]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. 2023

work page 2023
[28]

Falcon mamba: The first competitive attention-free 7b language model

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guil- laume Kunsch, and Hakim Hacid. Falcon mamba: The first competitive attention-free 7b language model. 2024

work page 2024
[29]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

work page 2021
[30]

Think you have solved question answering? try arc, the ai2 reasoning challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. 2018

work page 2018
[31]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. 2019

work page 2019
[32]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021
[33]

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023
[34]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts. arXiv preprint arXiv:2402.08562, 2024

work page arXiv 2024
[35]

Hmora: Making llms more effective with hierarchical mixture of lora experts

Mengqi Liao, Wei Chen, Junfeng Shen, Shengnan Guo, and Huaiyu Wan. Hmora: Making llms more effective with hierarchical mixture of lora experts. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[36]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024
[37]

Prefix-tuning: Optimizing continuous prompts for generation

Xiao Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 3458–3470, 2021

work page 2021
[39]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models

Elad Zaken, Shauli Ravfogel, Yoav Lang, Ran El-Yaniv, and Naftali Tishby. Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1164–1174, 2021

work page 2021
[40]

P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2303.03417, 2023

Xiao Liu, Yanan Zeng, Zheng Liu, Xiao Ding, Yujie Du, Jie Huang, Yixin Nie, Jilan Zhang, Zhiyuan Zhou, Chang Ren, et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2303.03417, 2023

work page arXiv 2023
[41]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Sebastian Jastrzebski, Andrzej Brooks, Rosanne de Vries, Andrea Guedj, and Grégory Nematzadeh. Parameter-efficient transfer learning for nlp. InProceedings of the 36th International Conference on Machine Learning, volume 97, pages 2791–2800, 2019

work page 2019
[42]

Gpt understands, too

Xiao Liu, Kaixuan Peng, Zheng Zhao, Ying Song, Xinyu Tan, Chen Wang, Ming Lyu, Weinan Zhou, Jin Yang, Jianlin Su, et al. Gpt understands, too. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 1016–1024, 2021

work page 2021
[43]

M2e: Multi-granular mixture of experts for neural machine translation

Xue Zhang, Boxing Zhao, Li Feng, Bo Zhou, and Xu Yu. M2e: Multi-granular mixture of experts for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2127–2137, 2018. 12

work page 2018
[44]

Towards modular llms by building and reusing a library of loras.arXiv preprint arXiv:2405.11157, 2024

Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, and Alessandro Sordoni. Towards modular llms by building and reusing a library of loras.arXiv preprint arXiv:2405.11157, 2024

work page arXiv 2024
[45]

Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv: 2402.11260, 2024b

Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, and Di Wang. Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

work page arXiv 2024
[46]

Octavius: Mitigating task interference in mllms via lora-moe

Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, and Jing Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023

work page arXiv 2023
[47]

Meteora: Multiple-tasks embedded lora for large language models.arXiv preprint arXiv:2405.13053, 2024

Jingwei Xu, Junyu Lai, and Yunpeng Huang. Meteora: Multiple-tasks embedded lora for large language models.arXiv preprint arXiv:2405.13053, 2024

work page arXiv 2024
[48]

Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851, 2024

Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851, 2024

work page arXiv 2024
[49]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023
[50]

Biderman, J

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024
[51]

Loftq: Lora- fine-tuning-aware quantization for large language models

Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023

work page arXiv 2023
[52]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

work page arXiv 2024
[53]

Learning a mixture of granularity- specific experts for fine-grained categorization

Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity- specific experts for fine-grained categorization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8331–8340, 2019

work page 2019
[54]

Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022

Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, and Ziwei Liu. Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022

work page arXiv 2022
[55]

Hard mixtures of experts for large scale weakly supervised vision

Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale weakly supervised vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017

work page 2017
[56]

Universal language model fine-tuning for text classi- fication

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Melbourne, Australia, July 2018. Association for Compu- tational Linguistics

work page 2018
[57]

Adapterfusion: Non-destructive task composition for transfer learning

Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Jack W Rae, Stephen King, and Sebastian Ruder. Adapterfusion: Non-destructive task composition for transfer learning. InAdvances in Neural Information Processing Systems, volume 32, pages 6649–6659, 2019

work page 2019
[58]

MAD-X: An adapter-based framework for multi-task cross-lingual transfer

Jonas Pfeiffer, Andreas Rücklé, Christian Poth, Aishwarya Anil, Ivan Texier, Sebastian Michael, and Iryna Gurevych. MAD-X: An adapter-based framework for multi-task cross-lingual transfer. InInternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7430–7439. PMLR, 2020

work page 2020
[59]

Parameter-efficient transfer learning with transformers

Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Andrea Waldon, and Stephen King. Parameter-efficient transfer learning with transformers. In International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2791–2800. PMLR, 2019. 13

work page 2019
[60]

Coupled mamba: Enhanced multi-modal fusion with coupled state space model

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model.arXiv preprint arXiv:2405.18014, 2024

work page arXiv 2024
[61]

intelligent

Dmitry Lepikhin, Hyoukjun Mehdad, Mostafa Shen, Tao Xu, Yanping Chen, Dmitry Krikun, and Minh-Thang Luong. Gshard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021. 14 A Experiment Result Table 9: Comparison of LoRA-Mixer on Falcon-Mamba, Mistral, and LLaMA across seven ...

work page arXiv 2021
[62]

25 Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025

[1] [1]

Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901

[2] [2]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

work page 2022

[5] [5]

Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

Chunlin Tian, Zhan Shi, Zhijiang Guo, Li Li, and Cheng-Zhong Xu. Hydralora: An asymmetric lora architecture for efficient fine-tuning.Advances in Neural Information Processing Systems, 37:9565–9584, 2024

work page 2024

[6] [6]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[7] [7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Dora: Weight-decomposed low-rank adaptation

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. Dora: Weight-decomposed low-rank adaptation. 2024

work page 2024

[9] [9]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. Adalora: Adaptive budget allocation for parameter- efficient fine-tuning.arXiv preprint arXiv:2303.10512, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices,

Bojia Zi, Xianbiao Qi, Lingzhi Wang, Jianan Wang, Kam-Fai Wong, and Lei Zhang. Delta- lora: Fine-tuning high-rank parameters with the delta of low-rank matrices.arXiv preprint arXiv:2309.02411, 2023

work page arXiv 2023

[11] [11]

Lora-drop: Efficient lora parameter pruning based on output evaluation,

Hongyun Zhou, Xiangyu Lu, Wang Xu, Conghui Zhu, Tiejun Zhao, and Muyun Yang. Lora-drop: Efficient lora parameter pruning based on output evaluation.arXiv preprint arXiv:2402.07721, 2024

work page arXiv 2024

[12] [12]

Lora+: Efficient low rank adaptation of large models,

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. Lora+: Efficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024

[13] [13]

Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison

Dengchun Li, Yingzi Ma, Naizheng Wang, Zhengmao Ye, Zhiyuan Cheng, Yinghao Tang, Yan Zhang, Lei Duan, Jie Zuo, Cal Yang, et al. Mixlora: Enhancing large language models fine-tuning with lora-based mixture of experts.arXiv preprint arXiv:2404.15159, 2024

work page arXiv 2024

[14] [14]

Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024a

Xun Wu, Shaohan Huang, and Furu Wei. Mixture of lora experts.arXiv preprint arXiv:2404.13628, 2024

work page arXiv 2024

[15] [15]

Huang, Q

Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. Lo- rahub: Efficient cross-task generalization via dynamic lora composition.arXiv preprint arXiv:2307.13269, 2023

work page arXiv 2023

[16] [16]

Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering.arXiv preprint arXiv:2409.16167, 2024

Ziyu Zhao, Tao Shen, Didi Zhu, Zexi Li, Jing Su, Xuwu Wang, Kun Kuang, and Fei Wu. Merging loras like playing lego: Pushing the modularity of lora to extremes through rank-wise clustering.arXiv preprint arXiv:2409.16167, 2024. 10

work page arXiv 2024

[17] [17]

Make lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization alignment.arXiv preprint arXiv:2502.16894, 2025

Chenghao Fan, Zhenyi Lu, Sichen Liu, Xiaoye Qu, Wei Wei, Chengfeng Gu, and Yu Cheng. Make lora great again: Boosting lora with adaptive singular values and mixture-of-experts optimization alignment.arXiv preprint arXiv:2502.16894, 2025

work page arXiv 2025

[18] [18]

Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism

Dengchun Li, Naizheng Wang, Zihao Zhang, Haoyang Yin, Lei Duan, Meng Xiao, and Mingjie Tang. Dynmole: Boosting mixture of lora experts fine-tuning with a hybrid routing mechanism. arXiv preprint arXiv:2504.00661, 2025

work page arXiv 2025

[19] [19]

H-more: Learning human-centric motion representation for action analysis.arXiv preprint arXiv:2504.10676, 2025

Zhanbo Huang, Xiaoming Liu, and Yu Kong. H-more: Learning human-centric motion representation for action analysis.arXiv preprint arXiv:2504.10676, 2025

work page arXiv 2025

[20] [20]

Glam: Efficient scaling of language models with mixture-of-experts

Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. pages 5547–5569, 2022

work page 2022

[21] [21]

DeepSeek-V3 Technical Report

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

Llava-mole: Sparse mixture of lora experts for mitigating data con- flicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms.arXiv preprint arXiv:2401.16160, 2024

work page arXiv 2024

[24] [24]

Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979, 2023

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, et al. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin.arXiv preprint arXiv:2312.09979, 2023

work page arXiv 2023

[25] [25]

K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025

Ziheng Ouyang, Zhen Li, and Qibin Hou. K-lora: Unlocking training-free fusion of any subject and style loras.arXiv preprint arXiv:2502.18461, 2025

work page arXiv 2025

[26] [26]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[27] [27]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b. 2023

work page 2023

[28] [28]

Falcon mamba: The first competitive attention-free 7b language model

Jingwei Zuo, Maksim Velikanov, Dhia Eddine Rhaiem, Ilyas Chahed, Younes Belkada, Guil- laume Kunsch, and Hakim Hacid. Falcon mamba: The first competitive attention-free 7b language model. 2024

work page 2024

[29] [29]

Training verifiers to solve math word problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. 2021

work page 2021

[30] [30]

Think you have solved question answering? try arc, the ai2 reasoning challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. 2018

work page 2018

[31] [31]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. 2019

work page 2019

[32] [32]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page 2021

[33] [33]

Llama 2: Open foundation and fine-tuned chat models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Harts...

work page 2023

[34] [34]

Higher layers need more lora experts.arXiv preprint arXiv:2402.08562,

Chongyang Gao, Kezhen Chen, Jinmeng Rao, Baochen Sun, Ruibo Liu, Daiyi Peng, Yawen Zhang, Xiaoyuan Guo, Jie Yang, and VS Subrahmanian. Higher layers need more lora experts. arXiv preprint arXiv:2402.08562, 2024

work page arXiv 2024

[35] [35]

Hmora: Making llms more effective with hierarchical mixture of lora experts

Mengqi Liao, Wei Chen, Junfeng Shen, Shengnan Guo, and Huaiyu Wan. Hmora: Making llms more effective with hierarchical mixture of lora experts. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[36] [36]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024

[37] [37]

Prefix-tuning: Optimizing continuous prompts for generation

Xiao Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 3458–3470, 2021

work page 2021

[38] [39]

Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models

Elad Zaken, Shauli Ravfogel, Yoav Lang, Ran El-Yaniv, and Naftali Tishby. Bitfit: Simple parameter-efficient fine-tuning for transformer-based language models. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1164–1174, 2021

work page 2021

[39] [40]

P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2303.03417, 2023

Xiao Liu, Yanan Zeng, Zheng Liu, Xiao Ding, Yujie Du, Jie Huang, Yixin Nie, Jilan Zhang, Zhiyuan Zhou, Chang Ren, et al. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.arXiv preprint arXiv:2303.03417, 2023

work page arXiv 2023

[40] [41]

Parameter-efficient transfer learning for nlp

Neil Houlsby, Sebastian Jastrzebski, Andrzej Brooks, Rosanne de Vries, Andrea Guedj, and Grégory Nematzadeh. Parameter-efficient transfer learning for nlp. InProceedings of the 36th International Conference on Machine Learning, volume 97, pages 2791–2800, 2019

work page 2019

[41] [42]

Gpt understands, too

Xiao Liu, Kaixuan Peng, Zheng Zhao, Ying Song, Xinyu Tan, Chen Wang, Ming Lyu, Weinan Zhou, Jin Yang, Jianlin Su, et al. Gpt understands, too. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (V olume 1: Long Papers), pages 1016–1024, 2021

work page 2021

[42] [43]

M2e: Multi-granular mixture of experts for neural machine translation

Xue Zhang, Boxing Zhao, Li Feng, Bo Zhou, and Xu Yu. M2e: Multi-granular mixture of experts for neural machine translation. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2127–2137, 2018. 12

work page 2018

[43] [44]

Towards modular llms by building and reusing a library of loras.arXiv preprint arXiv:2405.11157, 2024

Oleksiy Ostapenko, Zhan Su, Edoardo Maria Ponti, Laurent Charlin, Nicolas Le Roux, Matheus Pereira, Lucas Caccia, and Alessandro Sordoni. Towards modular llms by building and reusing a library of loras.arXiv preprint arXiv:2405.11157, 2024

work page arXiv 2024

[44] [45]

Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv: 2402.11260, 2024b

Shu Yang, Muhammad Asif Ali, Cheng-Long Wang, Lijie Hu, and Di Wang. Moral: Moe augmented lora for llms’ lifelong learning.arXiv preprint arXiv:2402.11260, 2024

work page arXiv 2024

[45] [46]

Octavius: Mitigating task interference in mllms via lora-moe

Zeren Chen, Ziqin Wang, Zhen Wang, Huayang Liu, Zhenfei Yin, Si Liu, Lu Sheng, Wanli Ouyang, Yu Qiao, and Jing Shao. Octavius: Mitigating task interference in mllms via lora-moe. arXiv preprint arXiv:2311.02684, 2023

work page arXiv 2023

[46] [47]

Meteora: Multiple-tasks embedded lora for large language models.arXiv preprint arXiv:2405.13053, 2024

Jingwei Xu, Junyu Lai, and Yunpeng Huang. Meteora: Multiple-tasks embedded lora for large language models.arXiv preprint arXiv:2405.13053, 2024

work page arXiv 2024

[47] [48]

Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851, 2024

Tongxu Luo, Jiahe Lei, Fangyu Lei, Weihao Liu, Shizhu He, Jun Zhao, and Kang Liu. Moelora: Contrastive learning guided mixture of experts on parameter-efficient fine-tuning for large language models.arXiv preprint arXiv:2402.12851, 2024

work page arXiv 2024

[48] [49]

S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

Ying Sheng, Shiyi Cao, Dacheng Li, Coleman Hooper, Nicholas Lee, Shuo Yang, Christopher Chou, Banghua Zhu, Lianmin Zheng, Kurt Keutzer, et al. S-lora: Serving thousands of concurrent lora adapters.arXiv preprint arXiv:2311.03285, 2023

work page arXiv 2023

[49] [50]

Biderman, J

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. Lora learns less and forgets less.arXiv preprint arXiv:2405.09673, 2024

work page arXiv 2024

[50] [51]

Loftq: Lora- fine-tuning-aware quantization for large language models

Yixiao Li, Yifan Yu, Chen Liang, Pengcheng He, Nikos Karampatziakis, Weizhu Chen, and Tuo Zhao. Loftq: Lora-fine-tuning-aware quantization for large language models.arXiv preprint arXiv:2310.08659, 2023

work page arXiv 2023

[51] [52]

arXiv preprint arXiv:2402.07871 , year=

Jakub Krajewski, Jan Ludziejewski, Kamil Adamczewski, Maciej Pióro, Michał Krutul, Szymon Antoniak, Kamil Ciebiera, Krystian Król, Tomasz Odrzygó´ zd´ z, Piotr Sankowski, et al. Scaling laws for fine-grained mixture of experts.arXiv preprint arXiv:2402.07871, 2024

work page arXiv 2024

[52] [53]

Learning a mixture of granularity- specific experts for fine-grained categorization

Lianbo Zhang, Shaoli Huang, Wei Liu, and Dacheng Tao. Learning a mixture of granularity- specific experts for fine-grained categorization. InProceedings of the IEEE/CVF international conference on computer vision, pages 8331–8340, 2019

work page 2019

[53] [54]

Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022

Bo Li, Yifei Shen, Jingkang Yang, Yezhen Wang, Jiawei Ren, Tong Che, Jun Zhang, and Ziwei Liu. Sparse mixture-of-experts are domain generalizable learners.arXiv preprint arXiv:2206.04046, 2022

work page arXiv 2022

[54] [55]

Hard mixtures of experts for large scale weakly supervised vision

Sam Gross, Marc’Aurelio Ranzato, and Arthur Szlam. Hard mixtures of experts for large scale weakly supervised vision. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017

work page 2017

[55] [56]

Universal language model fine-tuning for text classi- fication

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classi- fication. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), Melbourne, Australia, July 2018. Association for Compu- tational Linguistics

work page 2018

[56] [57]

Adapterfusion: Non-destructive task composition for transfer learning

Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Jack W Rae, Stephen King, and Sebastian Ruder. Adapterfusion: Non-destructive task composition for transfer learning. InAdvances in Neural Information Processing Systems, volume 32, pages 6649–6659, 2019

work page 2019

[57] [58]

MAD-X: An adapter-based framework for multi-task cross-lingual transfer

Jonas Pfeiffer, Andreas Rücklé, Christian Poth, Aishwarya Anil, Ivan Texier, Sebastian Michael, and Iryna Gurevych. MAD-X: An adapter-based framework for multi-task cross-lingual transfer. InInternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 7430–7439. PMLR, 2020

work page 2020

[58] [59]

Parameter-efficient transfer learning with transformers

Neil Houlsby, Andrei Giurgiu, Stanisław Jastrzebski, Bruna Morrone, Quentin De Vries, Andrea Waldon, and Stephen King. Parameter-efficient transfer learning with transformers. In International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 2791–2800. PMLR, 2019. 13

work page 2019

[59] [60]

Coupled mamba: Enhanced multi-modal fusion with coupled state space model

Wenbing Li, Hang Zhou, Junqing Yu, Zikai Song, and Wei Yang. Coupled mamba: Enhanced multi-modal fusion with coupled state space model.arXiv preprint arXiv:2405.18014, 2024

work page arXiv 2024

[60] [61]

intelligent

Dmitry Lepikhin, Hyoukjun Mehdad, Mostafa Shen, Tao Xu, Yanping Chen, Dmitry Krikun, and Minh-Thang Luong. Gshard: Scaling giant models with conditional computation and automatic sharding. InInternational Conference on Learning Representations, 2021. 14 A Experiment Result Table 9: Comparison of LoRA-Mixer on Falcon-Mamba, Mistral, and LLaMA across seven ...

work page arXiv 2021

[61] [62]

25 Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025