CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution
Pith reviewed 2026-05-20 12:04 UTC · model grok-4.3
The pith
CoX-MoE uses ordinary batch sizes and CPU-GPU co-execution to increase MoE inference throughput by avoiding memory-bound expert execution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that coalesced expert execution, achieved through ordinary batch sizes for expert computation and selective attention offloading in a CPU-GPU collaborative setup with AMX, combined with pre-assigning frequent experts to the GPU, mitigates the inefficiencies of micro-batching and PCIe transfers to deliver up to 7.1x higher throughput than FlexGen and 2.4x than MoE-Lightning.
What carries the argument
The coalescing-aware orchestration policy and static expert-aware stratification scheme that jointly optimize resource allocation and workload balancing between CPU and GPU for expert and attention computation.
If this is right
- MoE inference can achieve higher throughput on systems with both CPU and GPU by using larger batch sizes for experts.
- PCIe transfer overhead is reduced by keeping frequently used experts on the GPU.
- System utilization improves because CPU handles some computation while GPU focuses on others.
- End-to-end MoE decoding speed increases without requiring micro-batching that fragments workloads.
Where Pith is reading between the lines
- Similar co-execution strategies could apply to other memory-intensive AI workloads like large language models with sparse activation.
- This approach might reduce the need for multiple high-end GPUs in inference clusters by leveraging available CPU resources.
- Further work could test dynamic expert assignment instead of static pre-assignment for varying input distributions.
Load-bearing premise
The assumption that adopting ordinary batch sizes instead of micro-batches for expert computation will avoid memory-bound behavior and that selective attention offloading remains practical in the decode stage without major performance or correctness penalties.
What would settle it
Running the system with ordinary batch sizes and measuring if expert execution becomes compute-bound with higher operational intensity, or observing if selective attention offloading in decode causes noticeable latency increases or output errors compared to full GPU execution.
Figures
read the original abstract
The Mixture-of-Experts (MoE) architecture improves computational efficiency via sparse expert activation, but throughput-oriented inference faces substantial GPU memory pressure due to a significant parameter size and intermediate data. Prior works attempt to mitigate this using expert offloading with micro-batching or by offloading computation to the CPU. However, the fragmented workload resulting from micro-batching degrades operational intensity, causing expert execution to become memory-bound. Meanwhile, CPU offloading is constrained by slow PCIe transfers and its limited applicability to attention computation in the decode stage. Consequently, these inefficiencies prevent effective system utilization, severely restricting the end-to-end throughput of MoE inference. To address these challenges, this paper proposes CoX-MoE, an Advanced Matrix Extensions (AMX)-enabled CPU-GPU collaborative system that comprehensively optimizes MoE inference by combining coalesced expert execution with strategic workload orchestration for higher throughput. CoX-MoE introduces (i) a coalescing-aware orchestration policy to jointly optimize resource allocation by adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading, and (ii) a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU, mitigating PCIe transfer overhead and balancing workload for the CPU and GPU during inference. Compared to state-of-the-art frameworks, CoX-MoE delivers significant gains, achieving up to 7.1x and 2.4x higher throughput than FlexGen and MoE-Lightning, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes CoX-MoE, an AMX-enabled CPU-GPU co-execution system for high-throughput MoE inference. It addresses memory pressure from large expert parameters by combining coalesced expert execution (using ordinary batches rather than micro-batches), selective attention offloading, and a static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU to reduce PCIe overhead. The central claim is that these optimizations jointly improve resource allocation and workload balance, delivering up to 7.1x higher throughput than FlexGen and 2.4x higher than MoE-Lightning.
Significance. If the throughput gains are robustly demonstrated and the assumptions about batching and offloading hold, the work would represent a practical advance in hybrid CPU-GPU MoE serving by better exploiting AMX for expert computation and mitigating the memory-bound and transfer bottlenecks of prior offloading approaches.
major comments (3)
- [Abstract] Abstract: the throughput claims (7.1x over FlexGen, 2.4x over MoE-Lightning) are presented without any experimental details, workload descriptions, hardware specifications, or error analysis, so it is impossible to determine whether the gains are supported by the data or affected by unstated choices in batching or offloading.
- [Coalescing-aware orchestration policy] Coalescing-aware orchestration policy: the load-bearing assumption that ordinary (non-micro) batch sizes for expert computation will raise operational intensity enough to escape the memory-bound regime is not supported by roofline analysis or measurements; in autoregressive decode the effective batch size is typically 1 and KV-cache access dominates, so the claim that this choice avoids memory-bound behavior or PCIe penalties requires explicit verification.
- [Selective attention offloading] Selective attention offloading: the assertion that selective attention offloading to CPU remains low-overhead and correct during token-by-token decode lacks supporting evidence; PCIe transfers risk latency spikes and numerical drift without precise synchronization of activations, and the paper does not quantify these effects or demonstrate that they do not negate the reported gains.
minor comments (1)
- The description of the static expert-aware stratification scheme would benefit from explicit pseudocode or a diagram showing how activation frequency thresholds are used to pre-assign experts.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating planned revisions to strengthen the presentation of our results and methodology.
read point-by-point responses
-
Referee: [Abstract] Abstract: the throughput claims (7.1x over FlexGen, 2.4x over MoE-Lightning) are presented without any experimental details, workload descriptions, hardware specifications, or error analysis, so it is impossible to determine whether the gains are supported by the data or affected by unstated choices in batching or offloading.
Authors: We acknowledge that the abstract presents the key throughput claims concisely without embedding full experimental details, as is conventional for abstracts due to length constraints. The full manuscript (Sections 4 and 5) provides the complete experimental setup, including model architectures and workloads, hardware specifications (AMX-enabled CPU and specific GPU), batching parameters, and results with multiple runs and variability measures. To address the referee's concern and improve standalone readability, we will revise the abstract to include a brief reference to the primary experimental conditions and hardware platform. revision: yes
-
Referee: [Coalescing-aware orchestration policy] Coalescing-aware orchestration policy: the load-bearing assumption that ordinary (non-micro) batch sizes for expert computation will raise operational intensity enough to escape the memory-bound regime is not supported by roofline analysis or measurements; in autoregressive decode the effective batch size is typically 1 and KV-cache access dominates, so the claim that this choice avoids memory-bound behavior or PCIe penalties requires explicit verification.
Authors: We appreciate the referee's emphasis on verifying the operational intensity benefits in the decode stage. While per-token processing in autoregressive generation starts with a batch size of 1, our coalesced execution aggregates expert computations across tokens from multiple concurrent requests and sequences, which measurably increases arithmetic intensity compared to micro-batching. The manuscript reports end-to-end throughput improvements under these conditions, but we agree that explicit roofline analysis would strengthen the claim. In the revised version, we will add roofline plots and operational intensity measurements for both micro-batching and ordinary batching during decode, explicitly addressing KV-cache effects and confirming the shift away from the memory-bound regime. revision: yes
-
Referee: [Selective attention offloading] Selective attention offloading: the assertion that selective attention offloading to CPU remains low-overhead and correct during token-by-token decode lacks supporting evidence; PCIe transfers risk latency spikes and numerical drift without precise synchronization of activations, and the paper does not quantify these effects or demonstrate that they do not negate the reported gains.
Authors: We thank the referee for highlighting the importance of quantifying overheads and correctness for selective attention offloading in the decode phase. Our orchestration policy incorporates synchronization barriers and selective transfer of only necessary activations to maintain numerical fidelity and control latency. The reported throughput gains already reflect the net effect after any transfer costs, but we concur that dedicated quantification is needed. We will revise the manuscript to include explicit measurements of PCIe transfer times, per-token latency distributions, and numerical accuracy comparisons (e.g., output equivalence to full-GPU baselines) to demonstrate that these factors do not offset the overall performance benefits. revision: yes
Circularity Check
No circularity: empirical system claims rest on external benchmarks
full rationale
The paper describes a systems implementation (CoX-MoE) that combines coalesced expert execution, ordinary batch sizes for experts, selective attention offloading, and static expert stratification on AMX-enabled CPU-GPU hardware. Throughput gains are asserted via direct comparison to external baselines (FlexGen, MoE-Lightning) rather than any internal derivation, equation, fitted parameter, or prediction that reduces to the paper's own inputs. No self-definitional constructs, uniqueness theorems, ansatz smuggling, or renaming of known results appear; the load-bearing steps are engineering choices whose correctness is evaluated against independent measurements on the same hardware.
Axiom & Free-Parameter Ledger
free parameters (1)
- expert activation frequency threshold
axioms (1)
- domain assumption AMX instructions are present and deliver meaningful acceleration for the matrix operations arising in MoE expert layers.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
coalescing-aware orchestration policy ... adopting ordinary batch, instead of micro-batch, for expert computation and selective attention offloading
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
static expert-aware stratification scheme that pre-assigns frequently activated experts to the GPU
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Solution Brief. [n. d.]. Accelerate Artificial Intelligence (AI) Workloads with Intel Advanced Matrix Extensions (Intel AMX). ([n. d.])
-
[2]
Shiyi Cao, Shu Liu, Tyler Griggs, Peter Schafhalter, Xiaoxuan Liu, Ying Sheng, Joseph E Gonzalez, Matei Zaharia, and Ion Stoica. 2025. Moe-lightning: High- throughput moe inference on memory-constrained gpus. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 715–730
work page 2025
-
[3]
Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al . 2024. A survey on evaluation of large language models.ACM transactions on intelligent systems and technology15, 3 (2024), 1–45
work page 2024
-
[4]
Hongtao Chen, Weiyu Xie, Boxin Zhang, Jingqi Tang, Jiahao Wang, Jianwei Dong, Shaoyuan Chen, Ziwei Yuan, Chen Lin, Chengyu Qiu, et al. 2025. KTransformers: Unleashing the Full Potential of CPU/GPU Hybrid Inference for MoE Models. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 1014–1029
work page 2025
- [5]
-
[6]
Jack Choquette. 2023. Nvidia hopper h100 gpu: Scaling performance.IEEE Micro 43, 3 (2023), 9–17
work page 2023
-
[7]
Jack Choquette and Wish Gandhi. 2020. Nvidia a100 gpu: Performance & inno- vation for gpu computing. In2020 IEEE Hot Chips 32 Symposium (HCS). IEEE Computer Society, 1–43
work page 2020
-
[8]
Seongmin Hong, Seungjae Moon, Junsoo Kim, Sungjae Lee, Minsub Kim, Dongsoo Lee, and Joo-Young Kim. 2022. Dfx: A low-latency multi-fpga appliance for accel- erating transformer-based text generation. In2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 616–630
work page 2022
-
[9]
Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 1018–1031
work page 2024
-
[10]
Intel Corporation. 2025. Deep Learning with AVX512 and DL Boost. https://www.intel.com/content/www/us/en/developer/articles/guide/deep- learning-with-avx512-and-dl-boost.html. Accessed: 2025-11-18
work page 2025
-
[11]
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al . 2024. Mixtral of experts.arXiv preprint arXiv:2401.04088(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [12]
-
[13]
Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, and Nam Sung Kim. 2024. Exploiting intel advanced matrix extensions (AMX) for large language model inference.IEEE Computer Architecture Letters23, 1 (2024), 117–120
work page 2024
-
[14]
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al
-
[15]
Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al . 2024. Deepseek- v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [17]
- [18]
-
[19]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.arXiv preprint arXiv:1808.08745(2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[20]
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. 2025. A com- prehensive overview of large language models.ACM Transactions on Intelligent Systems and Technology16, 5 (2025), 1–72
work page 2025
-
[21]
NVIDIA. 2024. NVIDIA Nsight Systems. https://developer.nvidia.com/nsight- systems. Accessed: 2025-11-17
work page 2024
-
[22]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132
work page 2024
-
[23]
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning. PMLR, 31094–31116
work page 2023
- [24]
- [25]
-
[26]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Shuzhang Zhong, Ling Liang, Yuan Wang, Runsheng Wang, Ru Huang, and Meng Li. 2024. AdapMoE: Adaptive sensitivity-based expert gating and management for efficient moe inference. InProceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design. 1–9
work page 2024
- [28]
-
[29]
Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, An- drew M Dai, Quoc V Le, James Laudon, et al . 2022. Mixture-of-experts with expert choice routing.Advances in Neural Information Processing Systems35 (2022), 7103–7114
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.