HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

Jia Wei; Longxiang Wang; Ping Chen; Qianyang li; Shaoxun Wang; Yancheng Pan; Zhonghao Zhang; Ziyi Qiu

arxiv: 2605.18795 · v1 · pith:34EWLCNOnew · submitted 2026-05-11 · 💻 cs.LG · cs.AI

HELLoRA: Hot Experts Layer-Level Low-Rank Adaptation for Mixture-of-Experts Models

Jia Wei , Zhonghao Zhang , Ping Chen , Qianyang li , Yancheng Pan , Shaoxun Wang , Ziyi Qiu , Longxiang Wang This is my paper

Pith reviewed 2026-05-20 23:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords HELLoRAMixture-of-ExpertsLow-Rank AdaptationParameter-Efficient Fine-TuningExpert Activation FrequencyStructured RegularizationMoE ModelsAdapter Placement

0 comments

The pith

HELLoRA attaches LoRA only to the most frequently activated experts per layer in MoE models, cutting trainable parameters to 15.7% of vanilla LoRA while raising accuracy by 9.2%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HELLoRA to make parameter-efficient fine-tuning work better for Mixture-of-Experts models by placing low-rank adapters exclusively on the experts that fire most often at each layer. This selective placement shrinks the number of trainable parameters and the extra computation they cause, yet downstream accuracy rises on mathematical reasoning, code generation, and safety tasks. The authors link the gains to a form of structured regularization that keeps the experts' original specializations intact instead of overwriting them uniformly. They also show the idea combines with another low-rank method to reach even tighter parameter budgets. The approach is tested on three different MoE backbones and delivers consistent wins over standard LoRA and other PEFT baselines.

Core claim

HELLoRA attaches LoRA modules only to the most frequently activated experts at each layer of a Mixture-of-Experts model. This activation-aware placement reduces trainable parameters and adapter-induced FLOPs while improving accuracy, which the authors attribute to structured regularization that preserves pretrained expert specialization. When further composed with LoRI into HELLoRI, the method remains effective under extreme parameter constraints.

What carries the argument

Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA): the mechanism that ranks experts by activation frequency during fine-tuning and routes LoRA modules only to the highest-ranked ones per layer.

If this is right

On OlMoE, HELLoRA requires 15.7% of vanilla LoRA's trainable parameters, cuts adapter FLOPs by 38.7%, delivers 1.9x training throughput, and raises accuracy by 9.2%.
On DeepSeekMoE the same pattern holds with only 23.2% of LoRA's trainable parameters.
The method improves results across mathematical reasoning, code generation, and safety-alignment task families.
Composing HELLoRA with LoRI to form HELLoRI extends the gains to even smaller parameter budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Activation-frequency statistics collected during fine-tuning could serve as a lightweight prior for deciding which modules to adapt in other sparsely activated architectures.
If expert activation patterns shift markedly between pretraining and fine-tuning domains, the hot-expert ranking might need periodic recomputation to retain its advantage.
The same layer-wise frequency filter could be applied to other parameter-efficient methods such as prompt tuning or adapter layers beyond LoRA.

Load-bearing premise

That selecting adapters according to expert activation frequency supplies structured regularization that preserves the pretrained specialization of individual experts.

What would settle it

A controlled experiment in which the same number of experts per layer receive LoRA but are chosen uniformly at random instead of by activation frequency; if accuracy matches or exceeds the frequency-based version, the claimed benefit of hot-expert selection is falsified.

Figures

Figures reproduced from arXiv: 2605.18795 by Jia Wei, Longxiang Wang, Ping Chen, Qianyang li, Shaoxun Wang, Yancheng Pan, Zhonghao Zhang, Ziyi Qiu.

**Figure 1.** Figure 1: Expert activation patterns in OlMoE. (a) Across all layers, expert usage appears balanced (orange); within Layer 7 (green), the top-8 experts account for >50% of activations. (b) At the same layer, different tasks activate different expert subsets, confirming that hot experts are both layer-specific and task-specific. lines. Across three MoE backbones and three task families, HELLoRA achieves a stronger ac… view at source ↗

**Figure 2.** Figure 2: Comparison of LoRA and HELLoRA. Standard LoRA attaches adapters to all experts in every MoE layer (top). HELLoRA attaches adapters only to hot experts identified by the warm-up profiling stage, keeping cold experts frozen (bottom). Attention and gating components receive LoRA adapters in both settings. entries by absolute value: h = xW + xA(B ⊙ M), (3) where M is a binary mask selecting the top-10% entries… view at source ↗

**Figure 3.** Figure 3: Warm-up stability analysis. (a) Jaccard overlap between hot experts identified using 10% of the data and those identified using the full dataset. 4 out of 16 layers match exactly, 8 layers differ by one expert, and 4 layers differ by two experts. (b) Coverage of full-data hot experts and wall-clock overhead as the warm-up fraction increases. Using 10% of the data, our default setting, recovers 87.5% of the… view at source ↗

**Figure 4.** Figure 4: Training throughput on OlMoE (GSM8K). HELLoRA and its variants achieve approximately 1.9× the throughput of LoRA by removing adapter kernels from cold experts entirely. Unlike masking-based methods (LoRI), which retain all adapter parameters in the computational graph, HELLoRA reduces actual FLOPs and memory traffic, yielding a genuine wall-clock speedup. top-2 in Mixtral, and shared-plus-routed in DeepSe… view at source ↗

**Figure 5.** Figure 5: Expert activation heatmaps on OlMoE-1B-7B. Each panel shows one task. Rows correspond to MoE layers and columns to expert indices (0 to 63). Color intensity indicates activation frequency. Green boxes mark the top-8 hot experts per layer. Hot experts vary across both layers and tasks. E0 E0 E1 E1 E2 E2 E3 E3 E4 E4 E5 E5 E6 E6 E7 E7 Expert Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 … view at source ↗

**Figure 6.** Figure 6: Expert activation heatmap for Mixtral-8×7B on GSM8K. Each cell shows the activation ratio of one expert at one layer. Green boxes mark the top-2 hot experts per layer. Despite having only 8 experts with top-2 routing, activation remains highly skewed across all 32 layers, with the two most active experts typically accounting for over 50% of all token assignments. D Throughput Comparison with Layer Selectio… view at source ↗

**Figure 7.** Figure 7: Expert activation heatmap for DeepSeekMoE on GSM8K. Rows correspond to 27 MoE layers and columns to 64 routed expert indices. Green boxes mark the top-12 hot experts per layer. The top-12 experts account for 51.9% of activations on average, compared with 18.8% under uniform allocation. The identity of hot experts shifts across layers, confirming that activation skew is a general property of MoE architectur… view at source ↗

read the original abstract

Low-Rank Adaptation (LoRA) dominates parameter-efficient fine-tuning of large language models, yet most variants target dense architectures. Mixture-of-Experts (MoE) models scale parameters at near-constant per-token compute, and their sparse activation patterns create untapped opportunities for more efficient adaptation. We propose Hot-Experts Layer-level Low-Rank Adaptation (HELLoRA), which attaches LoRA modules only to the most frequently activated experts at each layer. This simple mechanism reduces trainable parameters and adapter-induced FLOPs while improving downstream performance, an effect we attribute to a form of structured regularization that preserves pretrained expert specialization. To stress-test HELLoRA under extreme parameter budgets, we further compose it with LoRI to form HELLoRI, which freezes the up-projection and sparsifies the down-projection. Across three MoE backbones, namely OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE, and three task families covering mathematical reasoning, code generation, and safety alignment, HELLoRA consistently outperforms strong PEFT baselines. Relative to vanilla LoRA on OlMoE, HELLoRA uses 15.7% of the trainable parameters, reduces adapter FLOPs by 38.7%, achieves 1.9x the training throughput, and improves accuracy by 9.2%. On DeepSeekMoE, HELLoRA outperforms LoRA while using only 23.2% of its trainable parameters. These results demonstrate that activation-aware adapter placement is an effective and practical route to scaling PEFT for MoE language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces HELLoRA, which applies LoRA adapters only to the most frequently activated ('hot') experts per layer in MoE models, along with HELLoRI as a composition with LoRI. It reports that this yields better downstream performance than strong PEFT baselines across OlMoE-1B-7B, Mixtral-8x7B, and DeepSeekMoE on mathematical reasoning, code generation, and safety alignment tasks, while using far fewer trainable parameters (e.g., 15.7% of vanilla LoRA on OlMoE) and achieving efficiency gains such as 38.7% fewer adapter FLOPs, 1.9x training throughput, and +9.2% accuracy.

Significance. If the reported accuracy improvements hold after controlling for parameter count and selection criteria, the work would offer a practical, activation-aware approach to scaling PEFT for MoE architectures that dominate large-scale deployment. The multi-backbone, multi-task empirical evaluation provides a useful data point for the community studying sparse adaptation.

major comments (1)

[Abstract] Abstract: The central performance claim attributes the 9.2% accuracy lift (and similar gains on other models) to 'a form of structured regularization that preserves pretrained expert specialization.' No isolating ablations are described that hold the number of adapters fixed while varying the selection rule (activation frequency vs. random selection or bottom-k). Without such controls or direct measurements (e.g., pre/post fine-tuning routing entropy or task-specific expert overlap), the attribution remains unverified and the efficiency numbers cannot be cleanly separated from the accuracy result.

minor comments (2)

[Abstract] Abstract and methods: The precise procedure for identifying 'hot' experts (threshold, count per layer, calibration set vs. online, static vs. dynamic) is not specified, limiting reproducibility.
Experimental section: Details on baseline hyperparameter tuning, statistical significance tests for accuracy deltas, and exact FLOPs measurement methodology are absent from the provided summary, which weakens verification of the efficiency claims.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The point about strengthening the causal attribution of performance gains through additional controls is well-taken, and we outline revisions below to address it directly.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claim attributes the 9.2% accuracy lift (and similar gains on other models) to 'a form of structured regularization that preserves pretrained expert specialization.' No isolating ablations are described that hold the number of adapters fixed while varying the selection rule (activation frequency vs. random selection or bottom-k). Without such controls or direct measurements (e.g., pre/post fine-tuning routing entropy or task-specific expert overlap), the attribution remains unverified and the efficiency numbers cannot be cleanly separated from the accuracy result.

Authors: We appreciate this observation. The current manuscript demonstrates that HELLoRA outperforms vanilla LoRA and other PEFT baselines while using substantially fewer parameters across multiple MoE backbones and task families. However, we agree that the manuscript would benefit from explicit isolating ablations that fix the adapter count and vary only the selection rule. In the revised version we will add experiments comparing hot-expert selection against random selection and bottom-k selection of the same number of experts per layer. We will also report pre- and post-fine-tuning routing entropy as well as task-specific expert overlap statistics to provide direct support for the structured-regularization interpretation. These additions will help separate the contribution of activation-aware placement from the efficiency metrics, which are computed directly from the reduced adapter count and are therefore independent of the accuracy results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical claims or derivations

full rationale

The paper presents HELLoRA as a practical, activation-aware adapter placement method evaluated empirically across three MoE models and task families. All reported gains (parameter reduction to 15.7%, FLOPs savings, throughput, accuracy lifts) are measured outcomes from fine-tuning experiments compared against vanilla LoRA and other PEFT baselines. No equations, first-principles derivations, or fitted parameters are defined in terms of the target predictions. The interpretive attribution to 'structured regularization preserving expert specialization' is post-hoc explanation, not a load-bearing logical step that reduces to self-definition or self-citation by construction. The work is self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on a domain assumption about expert specialization and likely introduces at least one selection hyperparameter for identifying hot experts.

free parameters (1)

hot expert selection threshold or count per layer
The criterion for choosing which experts receive adapters must be specified and is likely tuned to achieve the reported gains.

axioms (1)

domain assumption Frequent activation during adaptation indicates experts whose specialization should be preserved via selective adapter placement
The abstract explicitly attributes performance improvements to this form of structured regularization.

pith-pipeline@v0.9.0 · 5848 in / 1237 out tokens · 38614 ms · 2026-05-20T23:24:31.221066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across three MoE backbones... HELLoRA uses 15.7% of the trainable parameters... reduces adapter FLOPs by 38.7%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 6 internal anchors

[1]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

F. Bianchi, M. Suzgun, G. Attanasio, P. Röttger, D. Jurafsky, T. Hashimoto, and J. Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.arXiv preprint arXiv:2309.07875,

work page arXiv
[2]

M. Chen, J. Tworek, H. Jun, Q. Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

A. Liu, B. Feng, B. Xue, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Panda, B

A. Panda, B. Isik, X. Qi, S. Koyejo, T. Weissman, and P. Mittal. Lottery ticket adaptation: Mitigating destructive interference in LLMs. InICML 2024 Next Generation of AI Safety Workshop. X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InICLR,

work page 2024
[7]

L. Yun, Y . Zhuang, Y . Fu, E. P. Xing, and H. Zhang. Toward inference-optimal mixture-of-expert large language models.arXiv preprint arXiv:2404.02852,

work page arXiv
[8]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li. LoRA-FA: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Zhang, J

J. Zhang, J. You, A. Panda, and T. Goldstein. LoRI: Reducing cross-task interference in multi-task low-rank adaptation.arXiv preprint arXiv:2504.07448,

work page arXiv
[10]

Y . Zhao, A. Gu, R. Varma, L. Luo, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A Hyperparameters Appendix A reports the hyperparameters used in all experiments. We use the same LoRA rank and scaling factor for LoRA, DoRA, LoRI, HELLoRA, and HELLoRI variants unless otherwise specified, so that the comparison focuses on adapter placement rather than rank tuning. For full fine-tuning, we use a smaller learning rate and a larger batch s...

work page arXiv
[12]

Pure" attaches adapters only to expert FFNs,

Two consistent phenomena emerge. First, within each layer, activation is highly skewed toward a small number of hot experts. Second, the identity of hot experts varies across-tasks, confirming that expert importance is both layer-specific and task-specific. For Mixtral-8×7B (Figure 6), we report activation ratios on GSM8K across all 32 MoE layers. Despite...

work page 2024
[13]

First, on non-target tasks, HELLoRA fine-tuning generally preserves or even improves performance

Target GSM8K B GSM8KA HumanEvalB HumanEvalA SAFEB SAFEA GSM8K – – 13.41 13.78 66.75 58.43 HumanEval 1.66 5.86 – – 66.75 63.01 SAFE 1.66 2.83 13.41 15.18 – – Two patterns emerge from the table. First, on non-target tasks, HELLoRA fine-tuning generally preserves or even improves performance. For example, fine-tuning on GSM8K slightly increases HumanEval fro...

work page 2024

[1] [1]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions

F. Bianchi, M. Suzgun, G. Attanasio, P. Röttger, D. Jurafsky, T. Hashimoto, and J. Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.arXiv preprint arXiv:2309.07875,

work page arXiv

[2] [2]

M. Chen, J. Tworek, H. Jun, Q. Yuan, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

A. Q. Jiang, A. Sablayrolles, A. Roux, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

A. Liu, B. Feng, B. Xue, et al. DeepSeek-V3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Panda, B

A. Panda, B. Isik, X. Qi, S. Koyejo, T. Weissman, and P. Mittal. Lottery ticket adaptation: Mitigating destructive interference in LLMs. InICML 2024 Next Generation of AI Safety Workshop. X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! InICLR,

work page 2024

[7] [7]

L. Yun, Y . Zhuang, Y . Fu, E. P. Xing, and H. Zhang. Toward inference-optimal mixture-of-expert large language models.arXiv preprint arXiv:2404.02852,

work page arXiv

[8] [8]

LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

L. Zhang, L. Zhang, S. Shi, X. Chu, and B. Li. LoRA-FA: Memory-efficient low-rank adaptation for large language models fine-tuning.arXiv preprint arXiv:2308.03303,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Zhang, J

J. Zhang, J. You, A. Panda, and T. Goldstein. LoRI: Reducing cross-task interference in multi-task low-rank adaptation.arXiv preprint arXiv:2504.07448,

work page arXiv

[10] [10]

Y . Zhao, A. Gu, R. Varma, L. Luo, et al. PyTorch FSDP: Experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

A Hyperparameters Appendix A reports the hyperparameters used in all experiments. We use the same LoRA rank and scaling factor for LoRA, DoRA, LoRI, HELLoRA, and HELLoRI variants unless otherwise specified, so that the comparison focuses on adapter placement rather than rank tuning. For full fine-tuning, we use a smaller learning rate and a larger batch s...

work page arXiv

[12] [12]

Pure" attaches adapters only to expert FFNs,

Two consistent phenomena emerge. First, within each layer, activation is highly skewed toward a small number of hot experts. Second, the identity of hot experts varies across-tasks, confirming that expert importance is both layer-specific and task-specific. For Mixtral-8×7B (Figure 6), we report activation ratios on GSM8K across all 32 MoE layers. Despite...

work page 2024

[13] [13]

First, on non-target tasks, HELLoRA fine-tuning generally preserves or even improves performance

Target GSM8K B GSM8KA HumanEvalB HumanEvalA SAFEB SAFEA GSM8K – – 13.41 13.78 66.75 58.43 HumanEval 1.66 5.86 – – 66.75 63.01 SAFE 1.66 2.83 13.41 15.18 – – Two patterns emerge from the table. First, on non-target tasks, HELLoRA fine-tuning generally preserves or even improves performance. For example, fine-tuning on GSM8K slightly increases HumanEval fro...

work page 2024