LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

Chen Chen; Haris Mansoor; Md Kowsher; Nusrat Jahan Prottasha; Ozlem Garibay; Victor Zhu; Zhengping Ji

arxiv: 2604.02338 · v1 · submitted 2026-02-01 · 💻 cs.LG · cs.CL· cs.CV

LiME: Lightweight Mixture of Experts for Efficient Multimodal Multi-task Learning

Md Kowsher , Haris Mansoor , Nusrat Jahan Prottasha , Ozlem Garibay , Victor Zhu , Zhengping Ji , Chen Chen This is my paper

Pith reviewed 2026-05-16 08:24 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CV

keywords mixture of expertsparameter-efficient fine-tuningmultimodal multi-task learninglightweight modulationzero-parameter routingMMT-47 benchmarkexpert specialization

0 comments

The pith

LiME achieves expert specialization by modulating a single shared PEFT module with lightweight expert vectors instead of replicating adapters per expert.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MoE-PEFT methods typically attach separate adapters to each expert, so trainable parameters grow linearly with expert count and the approach stays limited to adapter-based fine-tuning. LiME replaces replication with lightweight modulation: a shared PEFT module processes inputs while small expert vectors adjust its outputs. The design removes learned router parameters by deriving routing signals directly from frozen and adapted representations, and it adds n-gram windowed routing plus adaptive Auto Top-K selection. Theoretical results establish that more experts retain additional task-relevant information and that the modulation approximates full per-expert PEFT within a bounded error. On the MMT-47 benchmark of 47 text, image, and video tasks, the method matches or exceeds standard MoE-PEFT baselines while using up to four times fewer trainable parameters and up to 29 percent less training time.

Core claim

LiME achieves expert specialization in multimodal multi-task learning by using lightweight modulation of a single shared PEFT module via expert vectors rather than per-expert adapter replication. It employs zero-parameter routing that reuses existing frozen and adapted representations, together with n-gram windowed routing and adaptive Auto Top-K expert selection. The method proves that increasing the number of experts preserves more task-relevant information and that modulation approximates full expert-specific PEFT with bounded error. On the MMT-47 benchmark spanning text, image, and video tasks, LiME matches or exceeds MoE-PEFT baselines while requiring up to 4x fewer trainable parameters

What carries the argument

Lightweight modulation of a shared PEFT module by expert vectors, paired with zero-parameter routing derived from existing representations.

Load-bearing premise

Modulating the output of one shared PEFT module with small expert vectors approximates the effect of separate full PEFT modules per expert within a bounded error.

What would settle it

A task distribution or PEFT method where the accuracy gap between LiME and standard per-expert MoE-PEFT exceeds the claimed bound or produces large degradation would falsify the approximation.

Figures

Figures reproduced from arXiv: 2604.02338 by Chen Chen, Haris Mansoor, Md Kowsher, Nusrat Jahan Prottasha, Ozlem Garibay, Victor Zhu, Zhengping Ji.

**Figure 1.** Figure 1: LiME is compatible with any PEFT method; we use LoRA only as an example. (a) MoE-LoRA replicates LoRA adapters (Ai, Bi) for each expert and uses a learned router, requiring E × |ϕ| adapter parameters plus di × E router parameters. (b) LiME shares a single PEFT module (LoRA here) and uses lightweight expert modulators pi ∈ R do , reducing trainable MoE parameters to |ϕ| + Edo. Router reuse: routing is compu… view at source ↗

**Figure 2.** Figure 2: Efficiency comparison of LiME vs. MoE-PEFT baselines. (a) LiME variants (stars) achieve higher throughput and shorter training time; LiMEPromptTuning is the most efficient (4.52 samples/s, 25 min). (b) All methods show comparable peak memory due to the dominant frozen backbone. (c) LiME requires 0.02–0.57M trainable parameters—up to 4× fewer than corresponding MoE-PEFT methods. (d) Total model size remains… view at source ↗

**Figure 3.** Figure 3: Empirical validation of our theory. (a–b) Linear probe accuracy at different token positions within an n-gram window (layers 24 and 22), supporting Theorem 3. (c–d) GLUE accuracy versus number of experts for LiME and MoELoRA, supporting Theorem 1; stars mark the best E for each method. representations of LiME and MoELoRA using Centered Kernel Alignment (CKA), a standard measure of representation similarit… view at source ↗

**Figure 4.** Figure 4: Routing ablations. (a) Feature selection for routing is robust. (b) Zero-parameter routing matches learned routing performance. (c-d) Routing balance γr ∈ [0.6, 0.8] yields optimal performance by combining frozen and adapted signals [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (a) Auto Top-K outperforms fixed Top-K. (b-c) Moderate load balancing prevents collapse while preserving specialization; over-balancing hurts accuracy. (d) Optimal expert count is E ∈ [4, 6]; beyond this, insufficient data limits further gains. Load Balancing Coefficient (Figure 5b-c). We analyze the effect of balancing coefficient (applied to both Limp and LKL). Figure 5b shows accuracy versus coefficient… view at source ↗

**Figure 6.** Figure 6: Auto Top-K ablation. (a) Auto Top-K outperforms fixed Top-K=2 on most tasks. (b-d) PCA visualizations show inputs naturally require varying numbers of experts: one (b), two (c), or three (d). F.1. Auto Top-K Motivation and Analysis We analyze the motivation for our Auto Top-K strategy over fixed Top-K selection. The core observation is that different inputs inherently require different numbers of experts—a… view at source ↗

**Figure 7.** Figure 7: Hyperparameter ablations. (a-b) N-gram window size shows robust performance across text and vision tasks. (c) Fixed Top-K selection peaks at k = 2-3; fewer limits expressiveness, more introduces noise. (d) Number of experts: optimal range E ∈ [3, 6] balances capacity and training efficiency. sharing routing decisions within windows, n-gram routing encourages locally consistent expert assignments that respe… view at source ↗

**Figure 8.** Figure 8: Scaling and temperature ablations. (a) LiME vs MoELoRA with increasing experts: LiME (green) maintains stable performance while MoELoRA (red) degrades significantly beyond E = 3–4, with drops of 8–10% on CoLA, MRPC, and STS-B due to overfitting. Green shading indicates where LiME outperforms. (b) MoE layer count: performance improves with more LiME layers, with optimal range at 20–24 layers (green region).… view at source ↗

**Figure 9.** Figure 9: Expert utilization heatmaps on SST-2 for varying balance coefficients (α = β). Without balancing (left), routing collapses to few experts per layer (low entropy H = 1.127). Increasing coefficients progressively balances utilization until near-uniform distribution (right, H = 1.385). Each cell shows the fraction of tokens routed to expert Ei at layer Lj . F.9. Expert Modulator Initialization (Figure 10a) We… view at source ↗

**Figure 10.** Figure 10: Ablation on initialization, target modules, and shared modulator. (a) Near-unity initialization (U(0.9, 1.1)) achieves best average accuracy (84.6%). (b) Applying LiME to all attention projections (q,k,v,o) yields optimal performance. (c) Shared modulator provides consistent gains across tasks. F.11. Shared Modulator (Figure 10c) We evaluate the contribution of the shared modulator ps in the output formul… view at source ↗

**Figure 11.** Figure 11: t-SNE visualization of routing representations across all 24 layers on GLUE, colored by dominant expert assignment. Inputs routed to the same expert cluster together, confirming that zero-parameter routing learns meaningful specialization. Clustering sharpness varies by layer, with later layers showing more distinct expert separation. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

read the original abstract

MoE-PEFT methods combine Mixture of Experts with parameter-efficient fine-tuning for multi-task adaptation, but require separate adapters per expert causing trainable parameters to scale linearly with expert count and limiting applicability to adapter-based architectures. We propose LiME (Lightweight Mixture of Experts), which achieves expert specialization through lightweight modulation rather than adapter replication. Instead of separate adapters, LiME uses a single shared PEFT module and modulates its output with lightweight expert vectors, reducing expert parameters while generalizing to any PEFT method. Notably, LiME introduces zero-parameter routing by leveraging existing frozen and adapted representations eliminating learned router parameters typically required per layer. Theoretically, we prove that (i) more experts preserve more task-relevant information and (ii) modulation approximates full expert-specific PEFT with bounded error. LiME further incorporates n-gram windowed routing and adaptive expert selection (Auto Top-K) based on routing confidence. Experiments on MMT-47, a multimodal multi-task benchmark with 47 tasks spanning text, image, and video, demonstrate that LiME achieves competitive or superior performance while using up to 4x fewer trainable parameters and up to 29% faster training compared to corresponding MoE-PEFT baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiME gives a clean parameter cut in MoE-PEFT via shared-module modulation and zero-param routing, but the claimed bounded-error approximation is asserted without numerical checks on the actual tasks.

read the letter

The core idea is to keep one shared PEFT adapter and modulate its output with small expert-specific vectors instead of duplicating full adapters per expert. They also drop the usual learned router by deriving routing scores directly from frozen and adapted representations, then add n-gram windowing and an auto top-k rule based on . That package produces the reported 4x parameter drop and 29% training speedup on the new MMT-47 benchmark while matching or beating standard MoE-PEFT baselines across text, image, and video tasks.

Referee Report

2 major / 3 minor

Summary. The paper proposes LiME, a lightweight Mixture of Experts method for multimodal multi-task learning. It replaces per-expert PEFT adapters with a single shared PEFT module whose outputs are modulated by lightweight expert-specific vectors, introduces zero-parameter routing that reuses frozen and adapted representations, and adds n-gram windowed routing plus Auto Top-K expert selection. Theoretical results claim that increasing the number of experts preserves more task-relevant information and that the modulation step approximates full expert-specific PEFT with bounded error. On the new MMT-47 benchmark (47 tasks across text, image, and video), LiME is reported to match or exceed MoE-PEFT baselines while using up to 4× fewer trainable parameters and up to 29% faster training.

Significance. If the bounded-error claim for shared-PEFT modulation holds with a tight, quantifiable bound across PEFT ranks and task distributions, the approach would meaningfully reduce the parameter scaling barrier in MoE-PEFT and generalize to arbitrary adapter families. The zero-parameter routing and the introduction of the MMT-47 benchmark are additional contributions that could influence efficient multi-task adaptation research.

major comments (2)

[§3.2] §3.2 (Modulation Approximation Theorem): the central claim that modulating a shared PEFT module approximates full expert-specific adapters with bounded error is load-bearing for the 4× parameter reduction result, yet the paper provides neither the explicit dependence of the bound on PEFT rank, modality, or task distribution nor any numerical evaluation of the approximation error on MMT-47 tasks.
[§4.2–4.3] §4.2–4.3 (Experimental Setup and Results): the reported gains (up to 4× fewer parameters, 29% faster training) rest on single-run point estimates without error bars, multiple random seeds, or statistical significance tests; this weakens the claim that LiME is competitive or superior across the 47 tasks.

minor comments (3)

[Abstract] Abstract and §1: the phrase “theoretical proofs” appears without section or theorem numbers; add explicit cross-references to the statements in §3.
[§2.3] §2.3 (Routing): the n-gram window size and Auto Top-K threshold are introduced without an equation or pseudocode showing how routing scores are computed from the frozen representations.
[Table 2] Table 2: column headers for parameter counts should explicitly distinguish frozen vs. trainable parameters to avoid ambiguity when comparing to baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will make the indicated revisions to improve the clarity and robustness of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Modulation Approximation Theorem): the central claim that modulating a shared PEFT module approximates full expert-specific adapters with bounded error is load-bearing for the 4× parameter reduction result, yet the paper provides neither the explicit dependence of the bound on PEFT rank, modality, or task distribution nor any numerical evaluation of the approximation error on MMT-47 tasks.

Authors: The Modulation Approximation Theorem provides a general bound expressed in terms of the modulation vector norms and the operator norm of the shared PEFT module, which holds uniformly across modalities and task distributions. To make the dependence explicit, we will add a corollary in the revised §3.2 that isolates the scaling with PEFT rank r (error bounded by O(1/√r) under standard assumptions on the adapter weights). We will also add a new subsection in §4.3 with numerical approximation-error measurements (L2 output difference between modulated shared PEFT and per-expert PEFT) averaged over representative MMT-47 tasks for each modality and for ranks 8, 16, and 32. These additions will be included in the revision. revision: yes
Referee: [§4.2–4.3] §4.2–4.3 (Experimental Setup and Results): the reported gains (up to 4× fewer parameters, 29% faster training) rest on single-run point estimates without error bars, multiple random seeds, or statistical significance tests; this weakens the claim that LiME is competitive or superior across the 47 tasks.

Authors: We agree that single-run point estimates limit the strength of the empirical claims. In the revised manuscript we will rerun the primary MMT-47 experiments using at least three independent random seeds, report mean performance together with standard deviations, add error bars to all tables and figures, and include paired statistical significance tests (e.g., Wilcoxon signed-rank) against the strongest baselines. These changes will appear in the updated §4.2–4.3 and associated tables. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; claims rest on independent proof and experiments

full rationale

The paper asserts a new theoretical proof that modulation of a shared PEFT module approximates expert-specific adapters with bounded error and that more experts preserve more task information; these are presented as derived results rather than redefinitions of inputs. Zero-parameter routing is described as leveraging existing frozen representations without learned routers, and performance gains are tied to direct experiments on the MMT-47 benchmark rather than any fitted parameter renamed as a prediction. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled via prior work, and no equations reduce by construction to the claimed outputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the unproven-in-abstract assumption that modulation approximates full per-expert adapters with bounded error and that existing representations suffice for routing without learned parameters. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption Modulation of a shared PEFT output approximates expert-specific PEFT with bounded error
Invoked in the theoretical analysis section referenced by the abstract
domain assumption Existing frozen and adapted representations contain sufficient information for effective routing
Basis for the zero-parameter routing claim

pith-pipeline@v0.9.0 · 5538 in / 1491 out tokens · 26197 ms · 2026-05-16T08:24:47.161446+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2: LiME Approximates Expert-Specific PEFT ... modulation approximates full expert-specific PEFT with bounded error
IndisputableMonolith/Foundation/ArithmeticFromLogic LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1: Adding Experts Is Information-Preserving ... I(Y;Z_n) ≥ I(Y;Z_{n-1})

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

, n}there existsR e ∈R d×d such that for allx∈Supp(X|I n =e), A(n−1) rn−1(x)x=R e A(n) e x

(Factorization-on-support) For eache∈ {1, . . . , n}there existsR e ∈R d×d such that for allx∈Supp(X|I n =e), A(n−1) rn−1(x)x=R e A(n) e x

work page
[2]

at least half as good as the best

(Identifiability) There exists measurableˆe:R d → {1, . . . , n}such that In = ˆe(Zn)a.s. Then I(Y;Z n)≥I(Y;Z n−1). Proof.Defineh:R d →R d by h(z) :=R ˆe(z)z. Sinceˆeis measurable and eachz7→R ezis continuous,his measurable. Let N:={ω∈Ω :I n(ω)̸= ˆe(Zn(ω))}. By identifiability,P(N) = 0. For eache∈ {1, . . . , n}letS e :=Supp(X|I n =e). Then P(X∈S e |I n =...

work page 2026
[3]

As a result, each output dimension is influenced by many (often all) input dimensions

Any small slice of a transformer representation still carries global information.In transformers, each layer mixes information across dimensions through attention projections and feed-forward networks (Vaswani et al., 2017). As a result, each output dimension is influenced by many (often all) input dimensions. This means that even if we take only E dimens...

work page 2017
[4]

For example, Luo et al

Pretrained features are redundant, so reserving a small slice for routing is low-risk.Pretrained representations often contain substantial redundancy. For example, Luo et al. (2023) show that using only ∼1% of the most important feature dimensions can recover performance close to using the full representation. This supports our design in two ways. First, ...

work page 2023
[5]

Sharing features for routing can encourage meaningful grouping.Using existing representations for routing can also act as a useful inductive bias. If routing depends on the same representations used for adaptation, the model is encouraged to organize its feature space so that inputs with similar semantics (and similar required adaptations) are routed to s...

work page 2026
[6]

MoELoRA degrades significantly.In contrast, MoELoRA (red) exhibits sharp degradation beyond E= 3 –4

Performance typically peaks aroundE= 3–5and remains within a narrow range thereafter. MoELoRA degrades significantly.In contrast, MoELoRA (red) exhibits sharp degradation beyond E= 3 –4. The drops are particularly severe on CoLA (∼78% to ∼70%), STS-B (∼80% to ∼74%), and MRPC (∼83% to ∼73%)—losses of 8–10 percentage points. This divergence stems from overf...

work page 2026
[7]

Base Forward z←W 0x//z∈R B×T×d o, frozen computation

work page
[8]

PEFT Adaptation ˆz←ˆz(x)//ˆz∈R B×T×d o, any PEFT method

work page
[9]

Zero-Param Routing // no learned router! ˜z1:E ←z [:,:,1:E] /∥z [:,:,1:E]∥∞ // normalize firstEdims ˜ˆz1:E ←ˆz[:,:,1:E] /∥ˆz[:,:,1:E]∥∞ w←softmax (1−γr)·˜z1:E+γr·˜ˆz1:E τ // routing weights

work page
[10]

Auto Top-K Selection Sθ ← {i:w i ≥θ·max j wj}// adaptive selection ˜wi ←w i /P j∈Sθ wj // renormalize

work page
[11]

Expert Modulation P ← P i∈Sθ ˜wi ·p i // weighted expert combination

work page
[12]

Output h←z+ ˆz⊙ P+γ·(ˆz⊙p s)// base + routed + shared

work page
[13]

Load balancing (Step

Load Balance // training only ¯pi ← 1 BT P b,twi(xb,t)// mean routing prob per expert Limp ←E· PE i=1¯p2 i −1// importance loss LKL ←PE i=1¯pi log(E·¯pi)// KL-uniform loss L ← L task +α· L imp +β· L KL // total loss Return:h,L(loss only during training) LiME forward pass and training.Colors indicate: frozen , trainable , routing (0 params) , training only...

work page
[14]

41 Submission and Formatting Instructions for ICML 2026 I

prevents expert collapse. 41 Submission and Formatting Instructions for ICML 2026 I. Extended Related Work Parameter-Efficient Fine-Tuning.PEFT methods adapt large pre-trained models by updating only a small subset of parameters. LoRA (Hu et al., 2022) introduces low-rank decomposition for weight updates, while adapters insert lightweight modules between ...

work page 2026
[15]

PESC (Wu et al., 2024a) enables sparse model transitions in instruction tuning

decouples training for continual learning. PESC (Wu et al., 2024a) enables sparse model transitions in instruction tuning. Limitations of Existing MoE-PEFT.Existing methods share three limitations: (i)parameter explosion, with expert parameters scaling linearly with expert count (E× |ϕ| ); (ii)router overheadfrom learned routers adding parameters and auxi...

work page 2026
[16]

Multimodal Generalization: By training on text, video, and image data jointly, we test whether experts specialize by modality without explicit supervision

work page
[17]

inference for text, VQA vs

Multi-task Transfer: Within each modality, we include diverse tasks (e.g., sentiment vs. inference for text, VQA vs. classification for images) to test task-level routing

work page
[18]

The model must learn to route based on input characteristics alone

No Task Identifiers: Unlike prior multi-task learning setups, we do not provide explicit task or modality labels during training or inference. The model must learn to route based on input characteristics alone

work page
[19]

All datasets are formatted as text generation tasks with consistent prompt templates

Scale and Diversity: With 47 test sets across 5 categories, MMT-47 ensures comprehensive evaluation that goes beyond single-domain benchmarks. All datasets are formatted as text generation tasks with consistent prompt templates. For classification tasks, we format the output as the class label in text form. For regression tasks (STS-B), we discretize scor...

work page 2026

[1] [1]

, n}there existsR e ∈R d×d such that for allx∈Supp(X|I n =e), A(n−1) rn−1(x)x=R e A(n) e x

(Factorization-on-support) For eache∈ {1, . . . , n}there existsR e ∈R d×d such that for allx∈Supp(X|I n =e), A(n−1) rn−1(x)x=R e A(n) e x

work page

[2] [2]

at least half as good as the best

(Identifiability) There exists measurableˆe:R d → {1, . . . , n}such that In = ˆe(Zn)a.s. Then I(Y;Z n)≥I(Y;Z n−1). Proof.Defineh:R d →R d by h(z) :=R ˆe(z)z. Sinceˆeis measurable and eachz7→R ezis continuous,his measurable. Let N:={ω∈Ω :I n(ω)̸= ˆe(Zn(ω))}. By identifiability,P(N) = 0. For eache∈ {1, . . . , n}letS e :=Supp(X|I n =e). Then P(X∈S e |I n =...

work page 2026

[3] [3]

As a result, each output dimension is influenced by many (often all) input dimensions

Any small slice of a transformer representation still carries global information.In transformers, each layer mixes information across dimensions through attention projections and feed-forward networks (Vaswani et al., 2017). As a result, each output dimension is influenced by many (often all) input dimensions. This means that even if we take only E dimens...

work page 2017

[4] [4]

For example, Luo et al

Pretrained features are redundant, so reserving a small slice for routing is low-risk.Pretrained representations often contain substantial redundancy. For example, Luo et al. (2023) show that using only ∼1% of the most important feature dimensions can recover performance close to using the full representation. This supports our design in two ways. First, ...

work page 2023

[5] [5]

Sharing features for routing can encourage meaningful grouping.Using existing representations for routing can also act as a useful inductive bias. If routing depends on the same representations used for adaptation, the model is encouraged to organize its feature space so that inputs with similar semantics (and similar required adaptations) are routed to s...

work page 2026

[6] [6]

MoELoRA degrades significantly.In contrast, MoELoRA (red) exhibits sharp degradation beyond E= 3 –4

Performance typically peaks aroundE= 3–5and remains within a narrow range thereafter. MoELoRA degrades significantly.In contrast, MoELoRA (red) exhibits sharp degradation beyond E= 3 –4. The drops are particularly severe on CoLA (∼78% to ∼70%), STS-B (∼80% to ∼74%), and MRPC (∼83% to ∼73%)—losses of 8–10 percentage points. This divergence stems from overf...

work page 2026

[7] [7]

Base Forward z←W 0x//z∈R B×T×d o, frozen computation

work page

[8] [8]

PEFT Adaptation ˆz←ˆz(x)//ˆz∈R B×T×d o, any PEFT method

work page

[9] [9]

Zero-Param Routing // no learned router! ˜z1:E ←z [:,:,1:E] /∥z [:,:,1:E]∥∞ // normalize firstEdims ˜ˆz1:E ←ˆz[:,:,1:E] /∥ˆz[:,:,1:E]∥∞ w←softmax (1−γr)·˜z1:E+γr·˜ˆz1:E τ // routing weights

work page

[10] [10]

Auto Top-K Selection Sθ ← {i:w i ≥θ·max j wj}// adaptive selection ˜wi ←w i /P j∈Sθ wj // renormalize

work page

[11] [11]

Expert Modulation P ← P i∈Sθ ˜wi ·p i // weighted expert combination

work page

[12] [12]

Output h←z+ ˆz⊙ P+γ·(ˆz⊙p s)// base + routed + shared

work page

[13] [13]

Load balancing (Step

Load Balance // training only ¯pi ← 1 BT P b,twi(xb,t)// mean routing prob per expert Limp ←E· PE i=1¯p2 i −1// importance loss LKL ←PE i=1¯pi log(E·¯pi)// KL-uniform loss L ← L task +α· L imp +β· L KL // total loss Return:h,L(loss only during training) LiME forward pass and training.Colors indicate: frozen , trainable , routing (0 params) , training only...

work page

[14] [14]

41 Submission and Formatting Instructions for ICML 2026 I

prevents expert collapse. 41 Submission and Formatting Instructions for ICML 2026 I. Extended Related Work Parameter-Efficient Fine-Tuning.PEFT methods adapt large pre-trained models by updating only a small subset of parameters. LoRA (Hu et al., 2022) introduces low-rank decomposition for weight updates, while adapters insert lightweight modules between ...

work page 2026

[15] [15]

PESC (Wu et al., 2024a) enables sparse model transitions in instruction tuning

decouples training for continual learning. PESC (Wu et al., 2024a) enables sparse model transitions in instruction tuning. Limitations of Existing MoE-PEFT.Existing methods share three limitations: (i)parameter explosion, with expert parameters scaling linearly with expert count (E× |ϕ| ); (ii)router overheadfrom learned routers adding parameters and auxi...

work page 2026

[16] [16]

Multimodal Generalization: By training on text, video, and image data jointly, we test whether experts specialize by modality without explicit supervision

work page

[17] [17]

inference for text, VQA vs

Multi-task Transfer: Within each modality, we include diverse tasks (e.g., sentiment vs. inference for text, VQA vs. classification for images) to test task-level routing

work page

[18] [18]

The model must learn to route based on input characteristics alone

No Task Identifiers: Unlike prior multi-task learning setups, we do not provide explicit task or modality labels during training or inference. The model must learn to route based on input characteristics alone

work page

[19] [19]

All datasets are formatted as text generation tasks with consistent prompt templates

Scale and Diversity: With 47 test sets across 5 categories, MMT-47 ensures comprehensive evaluation that goes beyond single-domain benchmarks. All datasets are formatted as text generation tasks with consistent prompt templates. For classification tasks, we format the output as the class label in text form. For regression tasks (STS-B), we discretize scor...

work page 2026