CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

Joo-Young Kim; Muyoung Son; Seungjae Yoo; Soongyu Choi; Yi Chen

arxiv: 2605.17889 · v2 · pith:RUBQMX3Inew · submitted 2026-05-18 · 💻 cs.LG

CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution

Muyoung Son , Yi Chen , Seungjae Yoo , Soongyu Choi , Joo-Young Kim This is my paper

Pith reviewed 2026-05-20 12:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords mixture of expertsinference optimizationcpu gpu co-executionexpert offloadingamxthroughputmemory pressuremoe

0 comments

The pith

CoX-MoE uses ordinary batch sizes and CPU-GPU co-execution to increase MoE inference throughput by avoiding memory-bound expert execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large Mixture-of-Experts models create memory pressure during inference because their parameters exceed GPU capacity. Prior offloading methods use micro-batches that lower operational intensity and make execution memory-bound, or rely on slow CPU transfers that limit attention computation in the decode phase. CoX-MoE introduces coalesced expert execution on an AMX-enabled CPU-GPU system by using a coalescing-aware orchestration policy with ordinary batches and selective attention offloading, plus a static expert-aware stratification to keep frequent experts on the GPU. This combination allows better resource utilization and higher end-to-end throughput. A sympathetic reader would care because it enables running bigger models faster on existing mixed CPU-GPU hardware without additional accelerators.

Core claim

The central discovery is that coalesced expert execution, achieved through ordinary batch sizes for expert computation and selective attention offloading in a CPU-GPU collaborative setup with AMX, combined with pre-assigning frequent experts to the GPU, mitigates the inefficiencies of micro-batching and PCIe transfers to deliver up to 7.1x higher throughput than FlexGen and 2.4x than MoE-Lightning.

What carries the argument

The coalescing-aware orchestration policy and static expert-aware stratification scheme that jointly optimize resource allocation and workload balancing between CPU and GPU for expert and attention computation.

If this is right

MoE inference can achieve higher throughput on systems with both CPU and GPU by using larger batch sizes for experts.
PCIe transfer overhead is reduced by keeping frequently used experts on the GPU.
System utilization improves because CPU handles some computation while GPU focuses on others.
End-to-end MoE decoding speed increases without requiring micro-batching that fragments workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar co-execution strategies could apply to other memory-intensive AI workloads like large language models with sparse activation.
This approach might reduce the need for multiple high-end GPUs in inference clusters by leveraging available CPU resources.
Further work could test dynamic expert assignment instead of static pre-assignment for varying input distributions.

Load-bearing premise

The assumption that adopting ordinary batch sizes instead of micro-batches for expert computation will avoid memory-bound behavior and that selective attention offloading remains practical in the decode stage without major performance or correctness penalties.

What would settle it

Running the system with ordinary batch sizes and measuring if expert execution becomes compute-bound with higher operational intensity, or observing if selective attention offloading in decode causes noticeable latency increases or output errors compared to full GPU execution.

Figures

Figures reproduced from arXiv: 2605.17889 by Joo-Young Kim, Muyoung Son, Seungjae Yoo, Soongyu Choi, Yi Chen.

**Figure 2.** Figure 2: Analysis of Micro-Batching Strategy for Inference. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 5.** Figure 5: Example of timing diagram for a single layer of [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Expert-Aware Stratification Workflow. 𝑇comp follows a roofline model based on the hardware configuration of each device. The latency is determined by the bottleneck resource, which is either the memory bandwidth (𝐵𝑊 ) or the computational performance (𝑇 𝐹 ). 𝑇comp (𝑂𝑃𝑖) = 𝑀 max (𝐷𝑋𝑖 + 𝐷𝑌𝑖 )/𝐵𝑊DEV, C𝑖/TFDEV (5) Finally, 𝑇store is 𝑇store (𝑂𝑃𝑖) = ( 𝑀𝐷KV/𝐵𝑊PCIe, if 𝑖 = 1, 𝑥0 = 1, 𝑥1 = 0, 0, otherwise. (6) w… view at source ↗

**Figure 7.** Figure 7: Inference throughput (Tokens/s) comparison between CoX-MoE, MoE-Lightning and FlexGen acorss the system [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: (a), (b) The relation ship between expert hit ratio and [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

read the original abstract

The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoX-MoE gives a concrete but lightly validated recipe for CPU-GPU MoE inference that trades micro-batching for ordinary batches and static expert placement.

read the letter

The main point is a hybrid CPU-GPU system that coalesces expert work into ordinary-sized batches and pre-assigns frequent experts to the GPU while selectively moving some attention to the CPU with AMX instructions. This aims to raise operational intensity and cut PCIe traffic compared with earlier offloading schemes. The paper does a clear job naming the problems with micro-batching (fragmented workloads that stay memory-bound) and with broad CPU offloading (slow transfers plus trouble with attention in decode). The reported 7.1x and 2.4x throughput numbers over FlexGen and MoE-Lightning are the kind of result that matters for people running large MoE models on existing servers. The static stratification and coalescing-aware policy are the incremental but practical additions that are not just re-packaged prior ideas. The central assumptions still need checking. Decode workloads usually run at effective batch size one with heavy KV-cache traffic, so switching to ordinary batches for experts may require cross-sequence batching or other mechanisms whose overhead is not obvious from the description. Selective attention offloading over PCIe during autoregressive generation can introduce latency spikes or synchronization costs that the abstract does not quantify. The paper would benefit from more detail on exact hardware, workload traces, and whether the gains survive when those two choices are stress-tested on the same setup. This is useful reading for systems engineers who deploy sparse models on hybrid CPU-GPU nodes and want throughput without buying more GPUs. It shows honest engagement with the practical bottlenecks, so it deserves a serious referee. I would send it to review with the expectation that referees will ask for tighter evidence on the decode-stage behavior and the experimental controls.

Referee Report

3 major / 1 minor

Summary. The paper proposes CoX-MoE, an AMX-enabled CPU-GPU co-execution system for high-throughput MoE inference. It addresses memory pressure from large expert parameters by combining coalesced expert execution (using ordinary batches rather than micro-batches), selective attention offloading, and a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU to reduce PCIe overhead. The central claim is that these optimizations jointly improve resource allocation and workload balance, delivering up to 7.1x higher throughput than FlexGen and 2.4x higher than MoE-Lightning.

Significance. If the throughput gains are robustly demonstrated and the assumptions about batching and offloading hold, the work would represent a practical advance in hybrid CPU-GPU MoE serving by better exploiting AMX for expert computation and mitigating the memory-bound and transfer bottlenecks of prior offloading approaches.

major comments (3)

[Abstract] Abstract: the throughput claims (7.1x over FlexGen, 2.4x over MoE-Lightning) are presented without any experimental details, workload descriptions, hardware specifications, or error analysis, so it is impossible to determine whether the gains are supported by the data or affected by unstated choices in batching or offloading.
[Coalescing-aware orchestration policy] Coalescing-aware orchestration policy: the load-bearing assumption that ordinary (non-micro) batch sizes for expert computation will raise operational intensity enough to escape the memory-bound regime is not supported by roofline analysis or measurements; in autoregressive decode the effective batch size is typically 1 and KV-cache access dominates, so the claim that this choice avoids memory-bound behavior or PCIe penalties requires explicit verification.
[Selective attention offloading] Selective attention offloading: the assertion that selective attention offloading to CPU remains low-overhead and correct during token-by-token decode lacks supporting evidence; PCIe transfers risk latency spikes and numerical drift without precise synchronization of activations, and the paper does not quantify these effects or demonstrate that they do not negate the reported gains.

minor comments (1)

The description of the static expert-aware stratification scheme would benefit from explicit pseudocode or a diagram showing how activation frequency thresholds are used to pre-assign experts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the presentation of our results and methodology.

read point-by-point responses

Referee: [Abstract] Abstract: the throughput claims (7.1x over FlexGen, 2.4x over MoE-Lightning) are presented without any experimental details, workload descriptions, hardware specifications, or error analysis, so it is impossible to determine whether the gains are supported by the data or affected by unstated choices in batching or offloading.

Authors: We acknowledge that the abstract presents the key throughput claims concisely without embedding full experimental details, as is conventional for abstracts due to length constraints. The full manuscript (Sections 4 and 5) provides the complete experimental setup, including model architectures and workloads, hardware specifications (AMX-enabled CPU and specific GPU), batching parameters, and results with multiple runs and variability measures. To address the referee's concern and improve standalone readability, we will revise the abstract to include a brief reference to the primary experimental conditions and hardware platform. revision: yes
Referee: [Coalescing-aware orchestration policy] Coalescing-aware orchestration policy: the load-bearing assumption that ordinary (non-micro) batch sizes for expert computation will raise operational intensity enough to escape the memory-bound regime is not supported by roofline analysis or measurements; in autoregressive decode the effective batch size is typically 1 and KV-cache access dominates, so the claim that this choice avoids memory-bound behavior or PCIe penalties requires explicit verification.

Authors: We appreciate the referee's emphasis on verifying the operational intensity benefits in the decode stage. While per-token processing in autoregressive generation starts with a batch size of 1, our coalesced execution aggregates expert computations across tokens from multiple concurrent requests and sequences, which measurably increases arithmetic intensity compared to micro-batching. The manuscript reports end-to-end throughput improvements under these conditions, but we agree that explicit roofline analysis would strengthen the claim. In the revised version, we will add roofline plots and operational intensity measurements for both micro-batching and ordinary batching during decode, explicitly addressing KV-cache effects and confirming the shift away from the memory-bound regime. revision: yes
Referee: [Selective attention offloading] Selective attention offloading: the assertion that selective attention offloading to CPU remains low-overhead and correct during token-by-token decode lacks supporting evidence; PCIe transfers risk latency spikes and numerical drift without precise synchronization of activations, and the paper does not quantify these effects or demonstrate that they do not negate the reported gains.

Authors: We thank the referee for highlighting the importance of quantifying overheads and correctness for selective attention offloading in the decode phase. Our orchestration policy incorporates synchronization barriers and selective transfer of only necessary activations to maintain numerical fidelity and control latency. The reported throughput gains already reflect the net effect after any transfer costs, but we concur that dedicated quantification is needed. We will revise the manuscript to include explicit measurements of PCIe transfer times, per-token latency distributions, and numerical accuracy comparisons (e.g., output equivalence to full-GPU baselines) to demonstrate that these factors do not offset the overall performance benefits. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system claims rest on external benchmarks

full rationale

The paper describes a systems implementation (CoX-MoE) that combines coalesced expert execution, ordinary batch sizes for experts, selective attention offloading, and static expert stratification on AMX-enabled CPU-GPU hardware. Throughput gains are asserted via direct comparison to external baselines (FlexGen, MoE-Lightning) rather than any internal derivation, equation, fitted parameter, or prediction that reduces to the paper's own inputs. No self-definitional constructs, uniqueness theorems, ansatz smuggling, or renaming of known results appear; the load-bearing steps are engineering choices whose correctness is evaluated against independent measurements on the same hardware.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on hardware assumptions and workload regularity that are not quantified in the abstract.

free parameters (1)

expert activation frequency threshold
Used to decide which experts are pre-assigned to GPU in the stratification scheme.

axioms (1)

domain assumption AMX instructions are present and deliver meaningful acceleration for the matrix operations arising in MoE expert layers.
The entire CPU co-execution path relies on this hardware feature.

pith-pipeline@v0.9.0 · 5822 in / 1271 out tokens · 44980 ms · 2026-05-20T12:04:05.543043+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

coalescing-aware orchestration policy ... adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

[1]

Solution Brief. [n. d.]. Accelerate Artificial Intelligence (AI) Workloads with Intel Advanced Matrix Extensions (Intel AMX). ([n. d.])

work page
[2]

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe-lightning: High- throughput moe inference on memory-constrained gpus. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 715–730

work page 2025
[3]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

work page 2024
[4]

Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, et al. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1014–1029

work page 2025
[5]

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. To- wards understanding mixture of experts in deep learning.arXiv preprint arXiv:2208.02813(2022)

work page arXiv 2022
[6]

Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance.IEEE Micro 43, 3 (2023), 9–17

work page 2023
[7]

Jack Choquette and Wish Gandhi. 2020. Nvidia a100 gpu: Performance & inno- vation for gpu computing. In2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 1–43

work page 2020
[8]

Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. Dfx: A low-latency multi-fpga appliance for accel- erating transformer-based text generation. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 616–630

work page 2022
[9]

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 1018–1031

work page 2024
[10]

Intel Corporation. 2025. Deep Learning with AVX512 and DL Boost. https://www.intel.com/content/www/us/en/developer/articles/guide/deep- learning-with-avx512-and-dl-boost.html. Accessed: 2025-11-18

work page 2025
[11]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. 2024. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033(2024)

work page arXiv 2024
[13]

Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, and Nam Sung Kim. 2024. Exploiting intel advanced matrix extensions (AMX) for large language model inference.IEEE Computer Architecture Letters23, 1 (2024), 117–120

work page 2024
[14]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

work page
[15]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng- Ann Heng, Minyi Guo, and Chao Li. 2024. A survey on inference optimization techniques for mixture of experts models.arXiv preprint arXiv:2412.14219(2024)

work page arXiv 2024
[18]

Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, Tianfan Fu, and Wenqi Wei. 2023. Machine learning for synthetic data generation: a review.arXiv preprint arXiv:2302.04062(2023)

work page arXiv 2023
[19]

Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72

work page 2025
[21]

NVIDIA. 2024. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight- systems. Accessed: 2025-11-17

work page 2024
[22]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page 2024
[23]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

work page 2023
[24]

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, and Minyi Guo. 2024. Hobbit: A mixed precision expert offloading system for fast moe inference.arXiv preprint arXiv:2411.01433(2024)

work page arXiv 2024
[25]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. Moe-infinity: Offloading-efficient moe model serving.arXiv preprint arXiv:2401.14361(2024)

work page arXiv 2024
[26]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design. 1–9

work page 2024
[28]

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference.arXiv preprint arXiv:2504.05897(2025)

work page arXiv 2025
[29]

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, An- drew M Dai, Quoc V Le, James Laudon, et al . 2022. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems35 (2022), 7103–7114

work page 2022

[1] [1]

Solution Brief. [n. d.]. Accelerate Artificial Intelligence (AI) Workloads with Intel Advanced Matrix Extensions (Intel AMX). ([n. d.])

work page

[2] [2]

Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe-lightning: High- throughput moe inference on memory-constrained gpus. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 715–730

work page 2025

[3] [3]

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45

work page 2024

[4] [4]

Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, et al. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1014–1029

work page 2025

[5] [5]

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. 2022. To- wards understanding mixture of experts in deep learning.arXiv preprint arXiv:2208.02813(2022)

work page arXiv 2022

[6] [6]

Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance.IEEE Micro 43, 3 (2023), 9–17

work page 2023

[7] [7]

Jack Choquette and Wish Gandhi. 2020. Nvidia a100 gpu: Performance & inno- vation for gpu computing. In2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 1–43

work page 2020

[8] [8]

Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. Dfx: A low-latency multi-fpga appliance for accel- erating transformer-based text generation. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 616–630

work page 2022

[9] [9]

Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 1018–1031

work page 2024

[10] [10]

Intel Corporation. 2025. Deep Learning with AVX512 and DL Boost. https://www.intel.com/content/www/us/en/developer/articles/guide/deep- learning-with-avx512-and-dl-boost.html. Accessed: 2025-11-18

work page 2025

[11] [11]

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Keisuke Kamahori, Tian Tang, Yile Gu, Kan Zhu, and Baris Kasikci. 2024. Fiddler: Cpu-gpu orchestration for fast inference of mixture-of-experts models.arXiv preprint arXiv:2402.07033(2024)

work page arXiv 2024

[13] [13]

Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, and Nam Sung Kim. 2024. Exploiting intel advanced matrix extensions (AMX) for large language model inference.IEEE Computer Architecture Letters23, 1 (2024), 117–120

work page 2024

[14] [14]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

work page

[15] [15]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[16] [16]

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Jiacheng Liu, Peng Tang, Wenfeng Wang, Yuhang Ren, Xiaofeng Hou, Pheng- Ann Heng, Minyi Guo, and Chao Li. 2024. A survey on inference optimization techniques for mixture of experts models.arXiv preprint arXiv:2412.14219(2024)

work page arXiv 2024

[18] [18]

Yingzhou Lu, Minjie Shen, Huazheng Wang, Xiao Wang, Capucine van Rechem, Tianfan Fu, and Wenqi Wei. 2023. Machine learning for synthetic data generation: a review.arXiv preprint arXiv:2302.04062(2023)

work page arXiv 2023

[19] [19]

Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745(2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72

work page 2025

[21] [21]

NVIDIA. 2024. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight- systems. Accessed: 2025-11-17

work page 2024

[22] [22]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

work page 2024

[23] [23]

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116

work page 2023

[24] [24]

Peng Tang, Jiacheng Liu, Xiaofeng Hou, Yifei Pu, Jing Wang, Pheng-Ann Heng, Chao Li, and Minyi Guo. 2024. Hobbit: A mixed precision expert offloading system for fast moe inference.arXiv preprint arXiv:2411.01433(2024)

work page arXiv 2024

[25] [25]

Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. Moe-infinity: Offloading-efficient moe model serving.arXiv preprint arXiv:2401.14361(2024)

work page arXiv 2024

[26] [26]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design. 1–9

work page 2024

[28] [28]

Shuzhang Zhong, Yanfan Sun, Ling Liang, Runsheng Wang, Ru Huang, and Meng Li. 2025. HybriMoE: Hybrid CPU-GPU Scheduling and Cache Management for Efficient MoE Inference.arXiv preprint arXiv:2504.05897(2025)

work page arXiv 2025

[29] [29]

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, An- drew M Dai, Quoc V Le, James Laudon, et al . 2022. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems35 (2022), 7103–7114

work page 2022