pith. sign in

arxiv: 2606.29613 · v1 · pith:QSQINX7Cnew · submitted 2026-06-28 · 💻 cs.LG · cs.AI

Does Role Specialization Matter for Explanation Faithfulness in Mixture-of-Experts?

Pith reviewed 2026-06-30 07:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords mixture of expertsexplanation faithfulnessrepresentation decorrelationmultimodal modelsrole specializationinterpretabilityattribution methodsregularization
0
0 comments X

The pith

Representation decorrelation in mixture-of-experts models raises explanation faithfulness scores while preserving task performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether assigning semantic roles to experts in mixture-of-experts architectures suffices for faithful explanations. It argues that overlap between the latent representations learned by different experts undermines those roles and reduces the reliability of attribution methods. The authors add a regularization term that penalizes similarity across expert representations to enforce greater separation. Experiments on multimodal benchmarks show gains in comprehensiveness, sufficiency, and AOPC metrics for both role-specialized and standard MoE models, with no drop in accuracy. The work concludes that explicit role labels by themselves do not guarantee faithful attributions and that controlling representation overlap supplies a more effective lever.

Core claim

The authors claim that inter-expert representation overlap weakens effective role separation and degrades attribution-based faithfulness even when semantic roles are explicitly defined, and that representation-level decorrelation regularization, by minimizing this overlap, consistently improves faithfulness metrics across multimodal benchmarks while preserving task performance; they further claim that the same gains appear in ordinary sparse MoE baselines, indicating that structural role decomposition alone is insufficient.

What carries the argument

representation-level decorrelation regularization, which minimizes inter-expert similarity in latent space to encourage clearer specialization

If this is right

  • Faithfulness measured by comprehensiveness, sufficiency, and AOPC improves across multiple multimodal benchmarks.
  • Task performance stays the same after the regularization is added.
  • The improvement appears in both role-based MoE variants and standard sparse MoE models.
  • Structural assignment of semantic roles by itself does not guarantee faithful explanations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same decorrelation step could be tested in non-MoE modular networks to check whether reduced representation overlap improves faithfulness more generally.
  • If representation overlap is the dominant obstacle, then other independence-promoting losses might produce similar gains without needing explicit role labels.
  • The method might be combined with post-hoc explanation techniques to see whether the two approaches compound.

Load-bearing premise

The chosen faithfulness metrics based on input perturbations correctly measure whether an explanation matches the model's actual internal decision process.

What would settle it

An experiment on the same models where decorrelation is applied, faithfulness metric scores rise, yet human inspection of the resulting attributions shows they still omit or mis-rank the factors the model actually used.

Figures

Figures reproduced from arXiv: 2606.29613 by Housam Babiker, Mi-Young Kim, Randy Goebel, Yeji Kim.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework, illustrated for a bimodal setting. Shared modality encoders produce feature-level representations, which are passed to role-based interaction experts. Each expert contains its own fusion model and produces expert￾specific logits and a latent representation. During training, the anchor latent represen￾tations are used to compute the representation-level decorrelation loss… view at source ↗
Figure 2
Figure 2. Figure 2: Validation sweeps on ENRICO. Left: mean pairwise linear CKA between expert representations as a function of λ. Right: validation AOPC comprehensiveness gap between identified and random masking. Moderate regularization reduces overlap and improves faithfulness, while excessively large λ can lead to unstable representations. Validation Sweeps and Selected Regularization Regimes [PITH_FULL_IMAGE:figures/ful… view at source ↗
Figure 3
Figure 3. Figure 3: Faithfulness perturbation curves using AttnRoll×Grad attribution. Top row shows comprehensiveness and bottom row shows sufficiency across masking ratios K ∈ {5, 10, 15, 20, 25}. Values denote mean scores across three runs. pend on a single modality, redundancy settings replicate the same signal across modalities, and synergy settings are generated using non-linear interactions such as XOR. Because the unde… view at source ↗
read the original abstract

Mixture-of-Experts (MoE) architectures have recently been extended with role-based mechanisms for interpretability. This is typically done by assigning semantic roles to individual expert components, for example roles like synergy, redundancy, and uniqueness in multimodal settings. However, whether such structural role decomposition preserves explanation faithfulness of the overall architecture remains largely underexplored. We hypothesize that inter-expert representation overlap weakens effective role separation and degrades attribution-based faithfulness, even when semantic roles are explicitly defined. To address this limitation, we introduce representation-level decorrelation regularization to explicitly reduce inter-expert similarity in latent space. Using representation decorrelation objectives, we encourage clearer specialization among experts by minimizing representation overlap. Our experiments show that across multiple multimodal benchmarks, this separation consistently improves explanation faithfulness, as measured by comprehensiveness, sufficiency, and their Area Over the Perturbation Curve (AOPC) summaries, while preserving task performance. We further show that these improvements are not limited to role-based architectures such as Interpretable Multimodal Interaction-aware MoE (I2MoE). Similar trends are observed in a standard sparse MoE baseline, suggesting that representation-level separation may provide a more general mechanism for enhancing explanation faithfulness in MoE systems. Overall, our findings suggest that structural role decomposition alone may be insufficient to guarantee faithful explanations and that representation-level separation helps improve explanation faithfulness. To support reproducibility, the source code and supplementary material are publicly available at https://github.com/dut0817/FL-I2MoE_Decor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that inter-expert representation overlap in Mixture-of-Experts (MoE) models weakens attribution-based explanation faithfulness even when semantic roles (e.g., synergy, redundancy, uniqueness) are explicitly assigned, as in architectures like I2MoE. It introduces representation-level decorrelation regularization to minimize latent-space overlap and encourage specialization. Experiments across multimodal benchmarks show consistent gains in faithfulness metrics (comprehensiveness, sufficiency, and AOPC) while preserving task performance; similar trends hold in a standard sparse MoE baseline, suggesting the approach is more general than role-based decomposition alone. Code is released for reproducibility.

Significance. If the results are robust, the work would indicate that structural role assignment is insufficient to guarantee faithful explanations in MoE systems and that explicit representation separation offers a practical, general mechanism for improving attribution quality. The public code release supports reproducibility. Significance is limited by the absence of direct evidence that metric gains reflect improved alignment with internal routing rather than secondary effects on perturbation sensitivity.

major comments (2)
  1. [Abstract / experimental results] Abstract and experimental results: the central claim that decorrelation improves faithfulness rests on gains in comprehensiveness, sufficiency, and AOPC, yet no direct validation is described (e.g., correlation between attribution ranks and actual expert activation vectors or gating decisions on the same inputs) that would rule out the alternative that gains arise from altered perturbation sensitivity rather than better reflection of the model's decision process.
  2. [Methods] Methods / decorrelation objective: the regularization is applied to reduce inter-expert similarity, but the manuscript does not analyze how this interacts with the sparse gating mechanism (e.g., whether it changes expert selection probabilities or routing entropy), leaving open whether the faithfulness metric improvements are load-bearing on the claimed mechanism or on downstream changes in output sensitivity.
minor comments (2)
  1. [Abstract] The abstract refers to 'FL-I2MoE_Decor' without defining the acronym on first use.
  2. [Experiments] No mention of the number of runs, statistical significance tests, or variance across seeds for the reported metric improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the connection between representation decorrelation and explanation faithfulness. We address the major comments point by point below, proposing revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / experimental results] Abstract and experimental results: the central claim that decorrelation improves faithfulness rests on gains in comprehensiveness, sufficiency, and AOPC, yet no direct validation is described (e.g., correlation between attribution ranks and actual expert activation vectors or gating decisions on the same inputs) that would rule out the alternative that gains arise from altered perturbation sensitivity rather than better reflection of the model's decision process.

    Authors: We agree that direct validation linking attributions to internal routing would provide stronger evidence. In the revised version, we will add an experiment computing the correlation between attribution ranks and expert activation vectors/gating decisions on the same inputs. This analysis will help confirm that the observed gains in faithfulness metrics reflect improved alignment with the model's decision process. revision: yes

  2. Referee: [Methods] Methods / decorrelation objective: the regularization is applied to reduce inter-expert similarity, but the manuscript does not analyze how this interacts with the sparse gating mechanism (e.g., whether it changes expert selection probabilities or routing entropy), leaving open whether the faithfulness metric improvements are load-bearing on the claimed mechanism or on downstream changes in output sensitivity.

    Authors: We acknowledge the need to examine the interaction with the gating mechanism. We will include additional analysis in the methods and results sections reporting expert selection probabilities and routing entropy with and without the decorrelation regularization. This will clarify the mechanism through which the regularization affects faithfulness. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivation chain

full rationale

The paper introduces representation decorrelation regularization as a method and reports experimental results on faithfulness metrics (comprehensiveness, sufficiency, AOPC) across multimodal benchmarks. No equations, predictions, or first-principles derivations are presented that could reduce to fitted inputs or self-citations by construction. The central claims rest on observed metric improvements that are independently measurable and falsifiable via the described experiments. Minor self-citation to I2MoE is not load-bearing for the reported gains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only view limits visibility; the central empirical claim rests on the unstated premise that the faithfulness metrics are valid proxies and that the regularization hyperparameter can be chosen without introducing new confounding effects.

free parameters (1)
  • decorrelation regularization strength
    Hyperparameter that controls the weight of the representation overlap penalty; its value must be selected or tuned.
axioms (1)
  • domain assumption Faithfulness metrics (comprehensiveness, sufficiency, AOPC) accurately reflect explanation quality
    Used as the primary evaluation criteria for the claimed improvement.

pith-pipeline@v0.9.1-grok · 5810 in / 1139 out tokens · 33503 ms · 2026-06-30T07:09:52.860670+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 8 canonical work pages · 6 internal anchors

  1. [1]

    In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics

    Abnar, S., Zuidema, W.: Quantifying attention flow in transformers. In: Proceed- ings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 4190–4197. Online (Jul 2020)

  2. [2]

    On the Robustness of Interpretability Methods

    Alvarez-Melis, D., Jaakkola, T.S.: On the robustness of interpretability methods. arXiv preprint arXiv:1806.08049 (2018)

  3. [3]

    Gated Multimodal Units for Information Fusion

    Arevalo, J., Solorio, T., Montes-y Gómez, M., González, F.A.: Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992 (2017) 16 Y. Kim et al

  4. [4]

    IEEE Transactions on Knowledge and Data Engineering (2025)

    Cai, W., Jiang, J., Wang, F., Tang, J., Kim, S., Huang, J.: A survey on mixture of experts in large language models. IEEE Transactions on Knowledge and Data Engineering (2025)

  5. [5]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Chefer, H., Gur, S., Wolf, L.: Generic attention-model explainability for interpret- ing bi-modal and encoder-decoder transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 397–406 (2021)

  6. [6]

    Advances in neural information processing systems 35, 23049–23062 (2022)

    Chen, Z., Deng, Y., Wu, Y., Gu, Q., Li, Y.: Towards understanding the mixture-of- experts layer in deep learning. Advances in neural information processing systems 35, 23049–23062 (2022)

  7. [7]

    Advances in Neural Information Processing Systems35, 34600–34613 (2022)

    Chi, Z., Dong, L., Huang, S., Dai, D., Ma, S., Patra, B., Singhal, S., Bajaj, P., Song, X., Mao, X.L., et al.: On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems35, 34600–34613 (2022)

  8. [8]

    In: Proceedings of the 30th annual ACM symposium on user interface software and technology

    Deka, B., Huang, Z., Franzen, C., Hibschman, J., Afergan, D., Li, Y., Nichols, J., Kumar, R.: Rico: A mobile app dataset for building data-driven design ap- plications. In: Proceedings of the 30th annual ACM symposium on user interface software and technology. pp. 845–854 (2017)

  9. [9]

    In: Proceedings of the 58th annual meeting of the association for computational linguistics

    DeYoung, J., Jain, S., Rajani, N.F., Lehman, E., Xiong, C., Socher, R., Wallace, B.C.: Eraser: A benchmark to evaluate rationalized nlp models. In: Proceedings of the 58th annual meeting of the association for computational linguistics. pp. 4443–4458 (2020)

  10. [10]

    Do, G., Le, H., Tran, T.: Simsmoe: Toward efficient training mixture of experts via solvingrepresentationalcollapse.In:FindingsoftheAssociationforComputational Linguistics: NAACL 2025. pp. 2012–2025 (2025)

  11. [11]

    Journal of Machine Learning Research 23(120), 1–39 (2022)

    Fedus,W.,Zoph,B.,Shazeer,N.:Switchtransformers:Scalingtotrillionparameter models with simple and efficient sparsity. Journal of Machine Learning Research 23(120), 1–39 (2022)

  12. [12]

    arXiv preprint arXiv:2505.17553 (2025)

    Feng, J., Wei, C., Qiu, T., Hu, T., Pu, Z.: Comoe: Contrastive represen- tation for mixture-of-experts in parameter-efficient fine-tuning. arXiv preprint arXiv:2505.17553 (2025)

  13. [13]

    Neural computation3(1), 79–87 (1991)

    Jacobs, R.A., Jordan, M.I., Nowlan, S.J., Hinton, G.E.: Adaptive mixtures of local experts. Neural computation3(1), 79–87 (1991)

  14. [14]

    Scientific data10(1), 1 (2023)

    Johnson, A.E., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T.J., Hao, S., Moody, B., Gow, B., et al.: Mimic-iv, a freely accessible electronic health record dataset. Scientific data10(1), 1 (2023)

  15. [15]

    Feature-level Interaction Explanations in Multimodal Transformers

    Kim, Y., Babiker, H.K.B., Kim, M.Y., Goebel, R.: Feature-level interaction expla- nations in multimodal transformers. arXiv preprint arXiv:2603.13326 (2026)

  16. [16]

    In: International conference on machine learning

    Kornblith, S., Norouzi, M., Lee, H., Hinton, G.: Similarity of neural network rep- resentations revisited. In: International conference on machine learning. pp. 3519–

  17. [17]

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., Chen, Z.: Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020)

  18. [18]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

    Li, Y., Jiang, S., Hu, B., Wang, L., Zhong, W., Luo, W., Ma, L., Zhang, M.: Uni- moe: Scaling unified multimodal llms with mixture of experts. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  19. [19]

    Advances in neural information processing systems2021(DB1), 1 (2021) Does Role Specialization Matter for Explanation Faithfulness in MoE? 17

    Liang, P.P., Lyu, Y., Fan, X., Wu, Z., Cheng, Y., Wu, J., Chen, L., Wu, P., Lee, M.A., Zhu, Y., et al.: Multibench: Multiscale benchmarks for multimodal represen- tation learning. Advances in neural information processing systems2021(DB1), 1 (2021) Does Role Specialization Matter for Explanation Faithfulness in MoE? 17

  20. [20]

    IEEE Transactions on Multimedia (2026)

    Lin, B., Tang, Z., Ye, Y., Huang, J., Zhang, J., Pang, Y., Jin, P., Ning, M., Luo, J., Yuan, L.: Moe-llava: Mixture of experts for large vision-language models. IEEE Transactions on Multimedia (2026)

  21. [21]

    Lo, K.M., Huang, Z., Qiu, Z., Wang, Z., Fu, J.: A closer look into mixture-of- expertsinlargelanguagemodels.In:FindingsoftheAssociationforComputational Linguistics: NAACL 2025. pp. 4427–4447 (2025)

  22. [22]

    Computational Linguistics50(2), 657–723 (2024)

    Lyu, Q., Apidianaki, M., Callison-Burch, C.: Towards faithful model explanation in nlp: A survey. Computational Linguistics50(2), 657–723 (2024)

  23. [23]

    Regularizing CNNs with Locally Constrained Decorrelations

    Rodríguez, P., Gonzalez, J., Cucurull, G., Gonfaus, J.M., Roca, X.: Regulariz- ing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967 (2016)

  24. [24]

    IEEE transactions on neural networks and learning systems28(11), 2660–2673 (2016)

    Samek, W., Binder, A., Montavon, G., Lapuschkin, S., Müller, K.R.: Evaluating the visualization of what a deep neural network has learned. IEEE transactions on neural networks and learning systems28(11), 2660–2673 (2016)

  25. [25]

    In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA)

    Shahapure, K.R., Nicholas, C.: Cluster quality analysis using silhouette score. In: 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA). pp. 747–748 (2020). https://doi.org/10.1109/DSAA49011.2020.00096

  26. [26]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., Dean, J.: Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017)

  27. [27]

    In: International conference on machine learning

    Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In: International conference on machine learning. pp. 3319–3328. PMLR (2017)

  28. [28]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wenderoth, L., Hemker, K., Simidjievski, N., Jamnik, M.: Measuring cross-modal interactions in multimodal models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 21501–21509 (2025)

  29. [29]

    Wollstadt, P., Schmitt, S., Wibral, M.: A rigorous information-theoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition. J. Mach. Learn. Res.24(1) (Jan 2023)

  30. [30]

    In: Proceedings of the 42nd International Conference on Machine Learning

    Xin, J., Yun, S., Peng, J., Choi, I., Ballard, J.L., Chen, T., Long, Q.:I2MoE: In- terpretable multimodal interaction-aware mixture-of-experts. In: Proceedings of the 42nd International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 267, pp. 68870–68888. PMLR (13–19 Jul 2025)

  31. [31]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Yu, H., Qi, Z., Jang, L.K., Salakhutdinov, R., Morency, L.P., Liang, P.P.: MMoE: Enhancing multimodal models with mixtures of multimodal interaction experts. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 10006–10030. Miami, Florida, USA (Nov 2024)

  32. [32]

    In: International conference on machine learn- ing

    Zbontar, J., Jing, L., Misra, I., LeCun, Y., Deny, S.: Barlow twins: Self-supervised learning via redundancy reduction. In: International conference on machine learn- ing. pp. 12310–12320. PMLR (2021)