pith. machine review for the scientific record. sign in

arxiv: 2605.11800 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:21 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mixture of expertsanalog compute-in-memoryLLM noise robustnessexpert replacementrouter calibrationpost-training adaptationhardware imperfections
0
0 comments X

The pith

ROMER restores routing balance in MoE LLMs on analog CIM by swapping experts and normalizing router logits after training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Analog compute-in-memory hardware can cut memory bandwidth costs for MoE LLMs by keeping weights in place, yet its analog noise perturbs stored values and throws off the expert activations that clean training assumes. The paper demonstrates that this noise breaks load balance and makes the original router consistently pick the wrong experts. ROMER counters the problem with two post-training steps: it replaces experts that stay underactivated under noise with ones that activate often, and it rescales router logits through percentile normalization so the routing distribution stays stable. Experiments on three different MoE models show these steps cut perplexity by more than half when the models run under measured real-chip noise. Readers care because the fix lets sparse models run on low-power analog hardware without full retraining or new hardware redesigns.

Core claim

Noise calibrated from real analog CIM chips disrupts expert load balance in MoE LLMs and renders clean-trained routing decisions suboptimal; ROMER corrects this through post-training replacement of underactivated experts with high-frequency ones and percentile-based normalization of router logits, producing perplexity reductions of 58.6 percent, 58.8 percent, and 59.8 percent for DeepSeek-MoE, Qwen-MoE, and OLMoE respectively under the measured noise.

What carries the argument

Expert replacement to restore activation frequency balance together with percentile normalization of router logits to stabilize routing under noise.

If this is right

  • MoE models trained on clean data can still reach near-clean performance on analog hardware after a lightweight calibration pass.
  • Load balance among experts can be recovered by identifying and swapping in experts that activate frequently once noise is present.
  • Router outputs become noise-tolerant when their logits are forced to match the percentile distribution observed under hardware noise.
  • The same two-step procedure works across multiple MoE architectures without architecture-specific retraining.
  • Post-training calibration can substitute for hardware-aware training loops in many deployment settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dominant effect of analog noise appears to be a shift in which experts get chosen rather than outright corruption of individual weights.
  • Routers trained with explicit noise simulation during initial training might need less or no later calibration.
  • Hardware teams could use the same replacement logic to design CIM arrays whose noise profile favors balanced expert use.
  • The approach may extend to other sparse activation schemes such as sparse attention or dynamic networks on analog substrates.

Load-bearing premise

The noise statistics measured on the authors' specific real chip capture the imperfections that will appear in other analog CIM deployments.

What would settle it

Evaluating a ROMER-calibrated model on a second analog CIM chip whose noise was never measured and observing that the perplexity improvement vanishes or becomes negative.

Figures

Figures reproduced from arXiv: 2605.11800 by Ngai Wong, Taiqiang Wu, Wang Kang, Wenbo Qi, Wendong Xu, Wenyong Zhou, Yizhe Chen, Yuannuo Feng, Zhengwu Liu.

Figure 1
Figure 1. Figure 1: Qualitative comparison of vanilla and our ROMER methods for OLMoE under [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Comparison of in-memory computing chip and Von Neumann architectures. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bar chart(left) showing the Perplexity observed on both OLMOE-7B-A1B and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of our proposed ROMER framework. Hardware imperfections perturb [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Expert activation heatmaps of clean(left), vanilla (middle) and ROMER (right) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Large language models (LLMs) with mixture-of-experts (MoE) architectures achieve remarkable scalability by sparsely activating a subset of experts per token, yet their frequent expert switching creates memory bandwidth bottlenecks that compute-in-memory (CIM) architectures are well-suited to mitigate. However, analog CIM systems suffer from inherent hardware imperfections that perturb stored weights, and its negative impact on MoE-based LLMs in noisy CIM environments remains unexplored. In this work, we present the first systematic investigation of MoE-based LLMs under noise model calibrated with real chip measurements, revealing that hardware noise critically disrupts expert load balance and renders clean-trained routing decisions consistently suboptimal. Based on these findings, we propose ROMER, a post-training calibration framework that (1) replaces underactivated experts with high-frequency ones to restore load balance, and (2) recalibrates router logits via percentile-based normalization to stabilize routing under noise. Extensive experiments across multiple benchmarks demonstrate that ROMER achieves up to 58.6\%, 58.8\%, and 59.8\% reduction in perplexity under real-chip noise conditions for DeepSeek-MoE, Qwen-MoE, and OLMoE, respectively, establishing its effectiveness and generalizability across diverse MoE architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents ROMER, a post-training calibration framework for MoE LLMs deployed on analog compute-in-memory (CIM) hardware. It first shows via real-chip-calibrated noise that hardware imperfections disrupt expert load balance and make clean-trained routing suboptimal. ROMER then applies (1) replacement of under-activated experts by high-frequency ones and (2) percentile-based normalization of router logits. Experiments across DeepSeek-MoE, Qwen-MoE and OLMoE report perplexity reductions of 58.6–59.8 % under the measured noise.

Significance. If the results hold, the work is significant because it supplies the first systematic study of MoE routing under realistic analog CIM noise and demonstrates that lightweight post-training fixes can largely restore performance. Grounding the noise model in real-chip data and showing consistent gains across three distinct MoE families are clear strengths; the approach could materially ease deployment of large sparse models on energy-efficient analog accelerators.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (noise model): the central transferability claim rests on a single real-chip noise distribution; no data are given on number of chips, process corners, temperature, or layer-wise variation statistics, so it is unclear whether the calibrated percentile normalization and expert replacement will generalize without retuning.
  2. [§5] §5 (experiments): the reported perplexity reductions are not accompanied by clean-input baselines, statistical significance tests, or variance across multiple noise realizations; without these it is impossible to verify that ROMER does not trade noise robustness for clean degradation.
  3. [§4.2] §4.2 (router calibration): the percentile value and activation-frequency threshold are free parameters tuned on the observed noise; the paper does not demonstrate that the same values remain effective under modest changes to the noise statistics, which directly affects the robustness claim.
minor comments (2)
  1. [Figures and §3] Figure captions and §3 should explicitly state whether the reported perplexity numbers are on the same test sets used for clean baselines.
  2. [§4.2] Notation for router logits before/after normalization should be introduced once and used consistently to avoid ambiguity in the calibration equations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of our work while making revisions where they strengthen the claims without misrepresentation.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (noise model): the central transferability claim rests on a single real-chip noise distribution; no data are given on number of chips, process corners, temperature, or layer-wise variation statistics, so it is unclear whether the calibrated percentile normalization and expert replacement will generalize without retuning.

    Authors: We acknowledge that the noise model is derived from measurements on a single representative analog CIM chip. This choice was made to ground the study in real hardware data rather than synthetic models. In the revised manuscript, we have expanded the discussion in §4 to explicitly state the assumptions of our noise model, note the absence of multi-chip or process-corner statistics, and clarify that practical deployment may benefit from device-specific recalibration. The consistent gains across three architecturally distinct MoE families nevertheless indicate that the core mechanisms of expert replacement and percentile normalization are effective for the dominant noise characteristics observed in analog CIM. revision: partial

  2. Referee: [§5] §5 (experiments): the reported perplexity reductions are not accompanied by clean-input baselines, statistical significance tests, or variance across multiple noise realizations; without these it is impossible to verify that ROMER does not trade noise robustness for clean degradation.

    Authors: We agree that these controls are essential for a complete evaluation. The revised §5 now includes clean-input perplexity for all three models, confirming that ROMER introduces no degradation (and in some cases slight improvement) relative to the uncalibrated baseline on noise-free inputs. We additionally report mean perplexity and standard deviation over five independent noise realizations drawn from the calibrated distribution, together with paired t-test p-values demonstrating statistical significance of the reported gains (p < 0.01). These additions directly address the concern that robustness is achieved at the expense of clean performance. revision: yes

  3. Referee: [§4.2] §4.2 (router calibration): the percentile value and activation-frequency threshold are free parameters tuned on the observed noise; the paper does not demonstrate that the same values remain effective under modest changes to the noise statistics, which directly affects the robustness claim.

    Authors: The chosen percentile (90th) and activation-frequency threshold were selected to counteract the logit perturbation and load-imbalance patterns measured on the real chip. In the revised version we have added a sensitivity study (new Figure in §4.2 and appendix) that perturbs the noise variance by ±20 % around the measured value. The same fixed parameters continue to yield perplexity reductions within 5 % of the peak reported figures, indicating that the calibration is not brittle to modest deviations from the exact observed statistics. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical post-training framework (expert replacement to restore load balance + percentile normalization of router logits) calibrated directly from real-chip noise statistics. These operations are defined from observed data distributions and do not reduce, by the paper's own equations, to quantities fitted on the target perplexity metric. No self-definitional loops, fitted-input-called-prediction patterns, or load-bearing self-citations appear in the abstract or described methodology. The reported perplexity reductions are experimental outcomes under the calibrated noise model rather than derivations that collapse to their inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on an empirical noise model derived from chip measurements and two post-training heuristics whose parameters are chosen to match observed load imbalance; no new theoretical entities are introduced.

free parameters (2)
  • activation frequency threshold for expert replacement
    Determines which experts are considered underactivated and swapped; value chosen to restore balance under the measured noise.
  • percentile value for router logit normalization
    Controls the scaling applied to stabilize routing decisions; selected to counteract noise-induced logit shifts.
axioms (1)
  • domain assumption Noise statistics measured on real chips are representative of the target analog CIM deployment environment.
    Invoked when the authors state the noise model is calibrated with real chip measurements and used to evaluate ROMER.

pith-pipeline@v0.9.0 · 5555 in / 1377 out tokens · 54387 ms · 2026-05-13T07:21:35.447442+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 5 internal anchors

  1. [1]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  2. [2]

    Advances in neural information processing systems , volume=

    Towards understanding the mixture-of-experts layer in deep learning , author=. Advances in neural information processing systems , volume=

  3. [3]

    Advances in Neural Information Processing Systems , volume=

    Scaling vision with sparse mixture of experts , author=. Advances in Neural Information Processing Systems , volume=

  4. [4]

    Proceedings of the IEEE/CVF international conference on computer vision , pages=

    Swin transformer: Hierarchical vision transformer using shifted windows , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=

  5. [5]

    Scaling Laws for Neural Language Models

    Scaling laws for neural language models , author=. arXiv preprint arXiv:2001.08361 , year=

  6. [6]

    Neural computation , volume=

    Adaptive mixtures of local experts , author=. Neural computation , volume=. 1991 , publisher=

  7. [7]

    2023 , volume=

    Zhai, Yifeng and others , booktitle=. 2023 , volume=

  8. [8]

    2021 , volume=

    Geng, Yiwen and others , booktitle=. 2021 , volume=

  9. [9]

    2018 , volume=

    Kooli, Maha and others , booktitle=. 2018 , volume=

  10. [10]

    2021 , volume=

    Reis, Dayane and others , booktitle=. 2021 , volume=

  11. [11]

    2020 , volume=

    Gupta, Saransh and others , booktitle=. 2020 , volume=

  12. [12]

    Proceedings of the IEEE/CVF Conference on CVPR , year =

    Zhuang Liu and others , title =. Proceedings of the IEEE/CVF Conference on CVPR , year =

  13. [13]

    Vaswani, Ashish and others , journal=

  14. [14]

    and Crafton, Brian and Raychowdhury, Arijit and Fang, Yan , booktitle=

    Lele, Ashwin Sanjay and Chang, Muya and Spetalnick, Samuel D. and Crafton, Brian and Raychowdhury, Arijit and Fang, Yan , booktitle=. 2023 , volume=

  15. [15]

    2023 , volume=

    Chen, Xuan-Jun and Kuan, Cynthia and Yang, Chia-Lin , booktitle=. 2023 , volume=

  16. [16]

    2023 , volume=

    Chen, Jia and Tu, Fengbin and Shao, Kunming and Tian, Fengshi and Huo, Xiao and Tsui, Chi-Ying and Cheng, Kwang-Ting , booktitle=. 2023 , volume=

  17. [17]

    2023 , volume=

    Lee, Mingyen and Tang, Wenjun and Chen, Yiming and Wu, Juejian and Zhong, Hongtao and Xu, Yixin and Liu, Yongpan and Yang, Huazhong and Narayanan, Vijaykrishnan and Li, Xueqing , booktitle=. 2023 , volume=

  18. [18]

    2025 Design, Automation & Test in Europe Conference (DATE) , pages=

    HyIMC: Analog-Digital Hybrid In-Memory Computing SoC for High-Quality Low-Latency Speech Enhancement , author=. 2025 Design, Automation & Test in Europe Conference (DATE) , pages=. 2025 , organization=

  19. [19]

    2024 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA) , pages=

    Device Characteristic-Aware Quantization for eFlash-Based In-Memory Computing SoC , author=. 2024 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA) , pages=. 2024 , organization=

  20. [20]

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume=

    An End-to-End In-Memory Computing System Based on a 40-nm eFlash-Based IMC SoC: Circuits, Toolchains, and Systems Co-Design Framework , author=. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume=. 2024 , publisher=

  21. [21]

    2023 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA) , pages=

    A 40nm 5-16Tops/W@ INT8 eFlash In-Memory Computing SoC Chip with Noise Suppression and Compensation Techniques to Improve the Accuracy , author=. 2023 IEEE International Conference on Integrated Circuits, Technologies and Applications (ICTA) , pages=. 2023 , organization=

  22. [22]

    IEEE Transactions on Circuits and Systems II: Express Briefs , volume=

    A mini tutorial of processing in memory: From principles, devices to prototypes , author=. IEEE Transactions on Circuits and Systems II: Express Briefs , volume=. 2022 , publisher=

  23. [23]

    Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations , year=

    Zhou, Wenyong and others , booktitle=. Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations , year=

  24. [24]

    A Hardware-Aware Neural Architecture Search Pareto Front Exploration for In-Memory Computing , year=

    Guan, Ziyi and others , booktitle=. A Hardware-Aware Neural Architecture Search Pareto Front Exploration for In-Memory Computing , year=

  25. [25]

    A Time- and Energy-Efficient CNN with Dense Connections on Memristor-Based Chips , year=

    Zhou, Wenyong and others , booktitle=. A Time- and Energy-Efficient CNN with Dense Connections on Memristor-Based Chips , year=

  26. [26]

    HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware , year=

    Feng, Yuannuo and Zhou, Wenyong and Lyu, Yuexi and Liu, Hanjie and Liu, Zhengwu and Wong, Ngai and Kang, Wang , booktitle=. HPD: Hybrid Projection Decomposition for Robust State Space Models on Analog CIM Hardware , year=

  27. [27]

    Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware , year=

    Feng, Yuannuo and Zhou, Wenyong and Lyu, Yuexi and Zhang, Yixiang and Liu, Zhengwu and Wong, Ngai and Kang, Wang , booktitle=. Extending Straight-Through Estimation for Robust Neural Networks on Analog CIM Hardware , year=

  28. [28]

    Learning fast samplers for diffusion models by differentiating through sample quality , author=

  29. [29]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Progressive distillation for fast sampling of diffusion models , author=. arXiv preprint arXiv:2202.00512 , year=

  30. [30]

    NoiseGuard: A Comprehensive Framework with Noise Modeling, Noise-Aware Training and Noise Compensation for In-Memory Computing SoC , year=

    Wang, Guangyao and others , journal=. NoiseGuard: A Comprehensive Framework with Noise Modeling, Noise-Aware Training and Noise Compensation for In-Memory Computing SoC , year=

  31. [31]

    Wu, Taiqiang and others , journal=

  32. [32]

    Zhou, Wenyong and others , booktitle=

  33. [33]

    2025 , volume=

    Zhou, Wenyong and others , booktitle=. 2025 , volume=

  34. [34]

    2025 , keywords=

    Zhou, Wenyong and others , journal=. 2025 , keywords=

  35. [35]

    International Conference on Learning Representations (ICLR) , year =

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , author =. International Conference on Learning Representations (ICLR) , year =

  36. [36]

    Journal of Machine Learning Research , volume =

    Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity , author =. Journal of Machine Learning Research , volume =. 2022 , note =

  37. [37]

    International Conference on Learning Representations (ICLR) , year =

    GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding , author =. International Conference on Learning Representations (ICLR) , year =

  38. [38]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Scaling Vision with Sparse Mixture of Experts , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  39. [39]

    Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

    BASE Layers: Simplifying Training of Large, Sparse Models , author =. Proceedings of the 38th International Conference on Machine Learning (ICML) , year =

  40. [40]

    arXiv preprint arXiv:2112.06905 , year =

    GLaM: Efficient Scaling of Language Models with Mixture-of-Experts , author =. arXiv preprint arXiv:2112.06905 , year =

  41. [41]

    Mixtral of Experts

    Mixtral of Experts , author =. arXiv preprint arXiv:2401.04088 , year =

  42. [42]

    Qwen Technical Report

    Qwen Technical Report , author =. arXiv preprint arXiv:2309.16609 , year =

  43. [43]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  44. [44]

    Nature Communications , volume =

    Accurate Deep Neural Network Inference Using Computational Memory , author =. Nature Communications , volume =. 2020 , doi =

  45. [45]

    IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =

    Mitigating Non-Idealities in Memristor-Based Crossbar Neural Networks via Training with Hardware Variations , author =. IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , pages =. 2020 , doi =

  46. [46]

    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =

    Crossbar-Aware Neural Network Training: A Hardware/Algorithm Co-Optimization Approach , author =. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems , volume =. 2021 , doi =

  47. [47]

    Design Automation Conference (DAC) , pages =

    Tolerating Analog Noise in Crossbar-Based Neural Accelerators via Error-Correcting Codes , author =. Design Automation Conference (DAC) , pages =. 2020 , doi =

  48. [48]

    Proceedings of the 55th Annual Design Automation Conference (DAC) , pages =

    ARES: A Framework for Quantifying the Resilience of Deep Neural Networks , author =. Proceedings of the 55th Annual Design Automation Conference (DAC) , pages =. 2018 , doi =

  49. [49]

    IEEE International Symposium on Circuits and Systems (ISCAS) , pages =

    Fault Injection Attack on Deep Neural Network Hardware , author =. IEEE International Symposium on Circuits and Systems (ISCAS) , pages =. 2018 , doi =

  50. [50]

    Advances in Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning and Physical Sciences , year =

    Sparse DNNs with Improved Adversarial Robustness , author =. Advances in Neural Information Processing Systems (NeurIPS) Workshop on Machine Learning and Physical Sciences , year =

  51. [51]

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding

    Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding , author =. arXiv preprint arXiv:1510.00149 , year =

  52. [52]

    IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages =

    Efficient Hardware Acceleration of Sparse Matrix Operations for Deep Learning , author =. IEEE International Symposium on High Performance Computer Architecture (HPCA) , pages =. 2019 , doi =

  53. [53]

    Nature , volume=

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning , author=. Nature , volume=. 2025 , publisher=

  54. [54]

    Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

    Zhou, Yanqi and others , title =. Advances in Neural Information Processing Systems 35 (NeurIPS 2022) , year =

  55. [55]

    arXiv preprint arXiv:2502.05370 , year =

    Yu, Hanfei and others , title =. arXiv preprint arXiv:2502.05370 , year =

  56. [56]

    2026 Design, Automation and Test in Europe Conference (DATE) , year =

    Yuannuo, Feng and others , title =. 2026 Design, Automation and Test in Europe Conference (DATE) , year =

  57. [57]

    IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , year =

    Yayue, Hou and others , title =. IEEE/ACM International Conference on Computer-Aided Design (ICCAD) , year =