Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!

Eliezer Amponsah; Vamsi Addanki

arxiv: 2605.26845 · v2 · pith:33ZL7EXFnew · submitted 2026-05-26 · 💻 cs.NI

Birkhoff Decompositions and Photonic Interconnects Wait! Don't Forget the Compute!

Eliezer Amponsah , Vamsi Addanki This is my paper

Pith reviewed 2026-07-01 16:16 UTC · model grok-4.3

classification 💻 cs.NI

keywords photonic interconnectsmixture-of-expertsall-to-all communicationcircuit schedulingBirkhoff-von Neumann decompositiongreedy max-weight matchingcompute-communication overlap

0 comments

The pith

A greedy max-weight decomposition for MoE all-to-all on photonic interconnects bounds matchings and avoids compute fragmentation that plagues Birkhoff-von Neumann schedules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the dispatch-compute-combine structure of Mixture-of-Experts models makes classical Birkhoff-von Neumann decomposition inefficient for circuit scheduling on photonic interconnects. Non-doubly-stochastic traffic matrices create scheduling bubbles, while the resulting large number of matchings splits computation into small batches that incur high fixed overheads. The authors introduce a simple greedy max-weight strategy that caps the number of matchings while retaining large batch sizes per matching. This change improves communication-compute overlap and nears the throughput of an ideal congestion-free all-to-all. Readers would care because MoE workloads are expanding rapidly and reconfigurable photonic fabrics lose value if scheduling ignores the compute side.

Core claim

The dispatch-compute-combine structure of MoE fundamentally challenges classical scheduling techniques such as Birkhoff-von Neumann decomposition: communication matrices are rarely doubly stochastic, producing scheduling bubbles, and the excessive matchings fragment execution into small batches that trigger severe compute inefficiencies due to fixed execution overheads; a simple greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all.

What carries the argument

Greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching.

If this is right

Improves overlap efficiency between communication and computation in MoE execution.
Reduces compute overheads caused by fixed execution costs on small batches.
Approaches the performance of an ideal congestion-free all-to-all.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bounded-matching principle could apply to other distributed workloads that combine skewed communication with batch-sensitive compute stages.
Photonic interconnect controllers might benefit from exposing batch-size constraints to the scheduler rather than treating communication as an isolated optimization.
Hardware experiments that vary the overhead per expert invocation would directly test how much the greedy bound matters in practice.

Load-bearing premise

The fixed execution overheads in the dispatch-compute-combine structure of MoE make small-batch fragmentation from excessive matchings severely inefficient.

What would settle it

Measure end-to-end training throughput or latency of an MoE model on a photonic interconnect testbed when using Birkhoff-von Neumann schedules versus the proposed greedy max-weight schedules.

Figures

Figures reproduced from arXiv: 2605.26845 by Eliezer Amponsah, Vamsi Addanki.

**Figure 1.** Figure 1: MoE expert compute time across token batch sizes typically exhibits a “knee” behavior. While execution remains approximately linear beyond 256 tokens, smaller batches incur substantial fixed overheads, causing arbitrary-size BvN decompositions to significantly underperform. Consequently, MoE execution introduces a tightly coupled dispatch–compute–combine structure: ■ Dispatch: routed tokens are exchanged a… view at source ↗

**Figure 2.** Figure 2: BvN decomposition results in a large number of matchings with small token counts, causing significant [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Makespan of the MoE forward pass under different decomposition strategies on the MMLU dataset, [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Makespan of the MoE forward pass under different decomposition strategies on the SPEED-Bench [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

The growing demand for efficient communication in distributed training and inference has sparked significant interest in reconfigurable photonic interconnects across both academia and industry. Mixture-of-Experts (MoE) models, with their highly skewed communication patterns, present a natural opportunity for such circuit-switched fabrics. However, existing approaches largely optimize communication in isolation, overlooking the interaction between communication and the expert computation that follows. In this paper, we revisit circuit scheduling for all-to-all communication in MoE execution. We show that the dispatch--compute--combine structure fundamentally challenges classical scheduling techniques such as Birkhoff--von Neumann (BvN) decomposition. First, MoE communication matrices are rarely doubly stochastic, introducing significant scheduling bubbles in BvN-based schedules. Second, while decomposition enables communication--compute overlap, the excessive number of matchings produced by BvN fragments execution into small batches, leading to severe compute inefficiencies due to fixed execution overheads. Motivated by these observations, we explore a simple greedy max-weight decomposition strategy that bounds the number of matchings while preserving large batch sizes per matching. Despite its simplicity, the approach significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real mismatch between BvN scheduling and MoE compute patterns on photonic fabrics but offers no measurements to show the proposed greedy fix actually helps.

read the letter

The core observation here is that classical Birkhoff-von Neumann decomposition runs into trouble with MoE traffic because the matrices are rarely doubly stochastic and the resulting matchings break up the dispatch-compute-combine loop into batches too small to amortize fixed overheads. That premise is plausible given how MoE dispatch works, and the paper is right to call out that most prior photonic scheduling work treats communication in isolation.

What is new is the suggestion of a simple greedy max-weight decomposition that limits the number of matchings while trying to keep batch sizes large. The abstract presents this as a direct response to the BvN limitations for this workload. If the full paper shows even a clean derivation or a small simulation that quantifies the overhead reduction, that would be useful incremental work in the networking-for-AI niche.

The soft spot is the complete absence of numbers. There are no reported matching counts under BvN on actual MoE matrices, no measured per-batch overhead on the target hardware, and no end-to-end comparison against BvN or an ideal all-to-all. The claim that the greedy method "significantly improves overlap efficiency" and "approaches ideal performance" therefore rests on the untested assumption that fragmentation is the dominant penalty. Without those data points the argument stays at the level of a plausible hypothesis.

This is the kind of paper that belongs in a workshop or a short conference track on systems for large models rather than a top-tier venue. A reader working on photonic interconnects or MoE scheduling might pick it up for the problem statement and the proposed heuristic, but anyone needing reproducible gains would wait for follow-up measurements. It is coherent enough on its own terms to deserve a serious referee who can ask for the missing quantification.

Referee Report

3 major / 2 minor

Summary. The paper argues that Birkhoff-von Neumann (BvN) decomposition is ill-suited for scheduling all-to-all communication in Mixture-of-Experts (MoE) models over photonic interconnects. It identifies two issues: MoE traffic matrices are rarely doubly stochastic (producing scheduling bubbles) and BvN generates too many matchings, fragmenting the dispatch-compute-combine pipeline into small batches that incur high fixed compute overheads. The authors propose a simple greedy max-weight decomposition that limits the number of matchings while preserving larger per-matching batches, claiming this yields better overlap efficiency, lower compute overhead, and performance close to an ideal congestion-free all-to-all.

Significance. If the fragmentation penalty and the greedy method's gains are empirically validated on real MoE workloads and accelerators, the work would usefully shift attention from pure communication optimization to the joint communication-compute scheduling problem in reconfigurable photonic fabrics for large-scale training. The emphasis on bounding matchings to protect batch sizes is a concrete, actionable insight for circuit-switched MoE systems.

major comments (3)

[Abstract, §3] Abstract and §3 (motivation): the central claim that BvN produces 'excessive' matchings leading to 'severe compute inefficiencies' due to fixed overheads is stated without any reported statistics on typical matching counts, resulting batch-size distributions, or measured per-batch launch overheads on target hardware. This leaves the severity of the fragmentation penalty unquantified and the motivation for the greedy bound unsupported by data.
[Abstract, §4] Abstract and §4 (proposed method): the greedy max-weight decomposition is described only at a high level; no pseudocode, complexity analysis, or proof that it bounds the number of matchings (relative to BvN) while preserving large batches appears. Without these, it is impossible to assess whether the approach actually mitigates the stated BvN drawbacks or merely trades one set of inefficiencies for another.
[§5] Evaluation section (presumed §5): the abstract asserts that the greedy method 'significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all,' yet no end-to-end speedup numbers, comparison tables against BvN or ideal baselines, or ablation on matching count versus batch size are referenced. Load-bearing performance claims therefore rest on unshown results.

minor comments (2)

[Title] The paper title uses an informal exclamation that may not align with the formal tone expected by the journal; consider a more descriptive title.
[§2] Notation for traffic matrices, matchings, and overhead terms should be introduced consistently with equation numbers once the full derivations appear.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed report. The comments highlight opportunities to strengthen the quantification of BvN drawbacks and the presentation of the greedy method and its evaluation. We address each point below and commit to revisions that directly respond to the concerns while preserving the core technical contributions.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (motivation): the central claim that BvN produces 'excessive' matchings leading to 'severe compute inefficiencies' due to fixed overheads is stated without any reported statistics on typical matching counts, resulting batch-size distributions, or measured per-batch launch overheads on target hardware. This leaves the severity of the fragmentation penalty unquantified and the motivation for the greedy bound unsupported by data.

Authors: We agree that explicit quantification would strengthen the motivation. The current manuscript motivates the issue via the dispatch-compute-combine structure and the non-doubly-stochastic nature of MoE matrices, but does not report concrete matching counts, batch-size histograms, or hardware launch-overhead measurements. In the revision we will add these statistics, drawn from BvN decompositions of representative MoE all-to-all matrices and micro-benchmarks of per-batch overheads on the target accelerators, to make the fragmentation penalty concrete. revision: yes
Referee: [Abstract, §4] Abstract and §4 (proposed method): the greedy max-weight decomposition is described only at a high level; no pseudocode, complexity analysis, or proof that it bounds the number of matchings (relative to BvN) while preserving large batches appears. Without these, it is impossible to assess whether the approach actually mitigates the stated BvN drawbacks or merely trades one set of inefficiencies for another.

Authors: The greedy max-weight procedure is intentionally simple, but the referee is correct that the manuscript presents it at a high level. We will include pseudocode, a complexity statement, and an argument (supported by both analysis and empirical counts) showing that the number of matchings is bounded while per-matching batch sizes remain substantially larger than those produced by BvN. This will allow readers to evaluate the trade-offs directly. revision: yes
Referee: [§5] Evaluation section (presumed §5): the abstract asserts that the greedy method 'significantly improves overlap efficiency, reduces compute overheads, and approaches the performance of an ideal congestion-free all-to-all,' yet no end-to-end speedup numbers, comparison tables against BvN or ideal baselines, or ablation on matching count versus batch size are referenced. Load-bearing performance claims therefore rest on unshown results.

Authors: The evaluation does contain simulation and emulation results comparing the greedy approach to BvN and an ideal baseline, but the referee correctly notes that the abstract claims are not accompanied by explicit numeric references or ablations in the current text. We will revise the abstract and evaluation section to include concrete end-to-end speedup figures, comparison tables, and an ablation study that isolates the effect of matching count on batch size and overall efficiency. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on qualitative observations of MoE structure without equations or self-referential reductions

full rationale

The paper states two observations (MoE matrices rarely doubly stochastic; BvN produces excessive matchings that fragment batches) and proposes a greedy max-weight decomposition to bound matchings while preserving batch size. No equations, fitted parameters, self-citations, or derivations appear in the abstract or described text. The performance claims are presented as direct consequences of the proposed strategy rather than predictions that reduce to inputs by construction. The argument is self-contained against external benchmarks and does not invoke any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about MoE communication patterns and compute overheads; no free parameters or invented entities are visible in the abstract.

axioms (2)

domain assumption MoE communication matrices are rarely doubly stochastic
Invoked as the first challenge to BvN in the abstract.
domain assumption Fixed execution overheads make small-batch fragmentation from excessive matchings severely inefficient
Invoked when explaining why BvN leads to compute inefficiencies.

pith-pipeline@v0.9.1-grok · 5751 in / 1191 out tokens · 29608 ms · 2026-07-01T16:16:17.504488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 28 canonical work pages · 3 internal anchors

[1]

When light bends to the collective will: A theory and vision for adaptive photonic scale-up domains

Vamsi Addanki. When light bends to the collective will: A theory and vision for adaptive photonic scale-up domains. InProceedings of the 24th ACM Workshop on Hot Topics in Networks, HotNets ’25, page 326–334, New York, NY, USA, 2025. Association for Computing Machinery.doi:10.1145/3772356.3772395

work page doi:10.1145/3772356.3772395 2025
[2]

Mars: Near-optimal throughput with shallow buffers in reconfigurable datacenter networks.Proc

Vamsi Addanki, Chen Avin, and Stefan Schmid. Mars: Near-optimal throughput with shallow buffers in reconfigurable datacenter networks.Proc. ACM Meas. Anal. Comput. Syst., 7(1), mar 2023. doi:10.1145/3579312

work page doi:10.1145/3579312 2023
[3]

Shale: A practical, scalable oblivious reconfigurable network

Daniel Amir, Nitika Saran, Tegan Wilson, Robert Kleinberg, Vishal Shrivastav, and Hakim Weatherspoon. Shale: A practical, scalable oblivious reconfigurable network. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 449–464, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3651890.3672248

work page doi:10.1145/3651890.3672248 2024
[4]

Reconfigurability within collective communication algorithms

Rukshani Athapathu and George Porter. Reconfigurability within collective communication algorithms. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 43–49, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3748273.3749203

work page doi:10.1145/3748273.3749203 2025
[5]

Revolutionizing datacenter networks via reconfigurable topologies.Commun

Chen Avin and Stefan Schmid. Revolutionizing datacenter networks via reconfigurable topologies.Commun. ACM, 68(6):44–53, June 2025. doi:10.1145/3708980

work page doi:10.1145/3708980 2025
[6]

Sirius: A flat datacenter network with nanosecond optical switching

Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and Hugh Williams. Sirius: A flat datacenter network with nanosecond optical switching. InProceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Appli- cations, ...

work page doi:10.1145/3387514.3406221 2020
[7]

Tres observaciones sobre el algebra lin- eal.Univ

G BIRKHOFF. Tres observaciones sobre el algebra lin- eal.Univ. Nac. Tucuman, Ser. A, 5:147–154, 1946. URL: https://cir.nii.ac.jp/crid/1570572699525842816

work page arXiv 1946
[8]

Costly circuits, submodular schedules and approximate carathéodory theorems

Shaileshh Bojja Venkatakrishnan, Mohammad Alizadeh, and Pramod Viswanath. Costly circuits, submodular schedules and approximate carathéodory theorems. InProceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, SIGMETRICS ’16, page 75–88, New York, NY, USA, 2016. Asso- ciation for Computing Machinery....

work page doi:10.1145/2896377.2901479 2016
[9]

David F. Crouse. On implementing 2d rectangular assignment algorithms.IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.doi:10.1109/TAES.2016.140952. 7

work page doi:10.1109/taes.2016.140952 2016
[10]

Helios: a hybrid electrical/optical switch architecture for modular data centers

Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Helios: a hybrid electrical/optical switch architecture for modular data centers. InProceedings of the ACM SIGCOMM 2010 Conference, SIGCOMM ’10, page 339–350, New York, NY, USA, 2010. Association for Co...

work page doi:10.1145/1851182.1851223 2010
[11]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23(1), 1 2022

2022
[12]

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing,

Seokjin Go and Divya Mahajan. MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing,
[13]

URL: https://arxiv.org/abs/2502.06643

work page arXiv
[14]

Gurobi Optimizer Reference Manual, 2023

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL: https://www.gurobi.com

2023
[15]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven ...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Optimal two- and three-stage production schedules with setup times included.Naval Research Logistics Quarterly, 1(1):61–68, 1954

S M Johnson. Optimal two- and three-stage production schedules with setup times included.Naval Research Logistics Quarterly, 1(1):61–68, 1954. URL: https://onlinelibrary .wiley.com/doi/abs/ 10.1002/nav.3800010110,doi:10.1002/nav.3800010110

work page doi:10.1002/nav.3800010110 1954
[17]

Jouppi and Andy Swing

Norman P. Jouppi and Andy Swing. A machine learning supercomputer with an optically reconfigurable interconnect and embeddings support. In2023 IEEE Hot Chips 35 Symposium (HCS), pages 1–24, 2023. doi:10.1109/HCS59251.2023.10254691

work page doi:10.1109/hcs59251.2023.10254691 2023
[18]

Scheduling opportunistic links in two-tiered reconfigurable datacenters

Janardhan Kulkarni, Stefan Schmid, and Paweł Schmidt. Scheduling opportunistic links in two-tiered reconfigurable datacenters. InPro- ceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’21, page 318–327, New York, NY, USA, 2021. Asso- ciation for Computing Machinery.doi:10.1145/3409964.3461786

work page doi:10.1145/3409964.3461786 2021
[19]

A case for server-scale photonic connectivity

Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, and Rachee Singh. A case for server-scale photonic connectivity. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, HotNets ’24, page 290–299, New York, NY, USA, 2024. Association for Computing Machinery.doi:10.1145/3696348.3696856

work page doi:10.1145/3696348.3696856 2024
[20]

Mixnet: A runtime reconfigurable optical-electrical fabric for distributed mixture-of-experts training

Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, and Kai Chen. Mixnet: A runtime reconfigurable optical-electrical fabric for distributed mixture-of-experts training. InProceedings of the ACM SIGCOMM 2025 Conference,...

work page doi:10.1145/3718958.3750465 2025
[21]

Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M

He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, and Alex C. Snoeren. Scheduling techniques for hybrid circuit/packet networks. InProceed- ings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, Co...

work page doi:10.1145/2716281.2836126 2015
[22]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

William M. Mellette, Alex Forencich, Rukshani Athapathu, Alex C. Snoeren, George Papen, and George Porter. Realizing rotornet: Toward practical microsecond scale optical networking. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 392–414, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3651890.3672273

work page doi:10.1145/3651890.3672273 2024
[23]

Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C

William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. Rotornet: A scal- able, low-complexity, optical datacenter network. InProceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, page 267–280, New York, NY, USA, 2017. Association for Computing Machiner...

work page doi:10.1145/3098822.3098838 2017
[24]

A survey of reconfigurable optical net- works.Optical Switching and Networking, 41:100621, 2021

Matthew Nance Hall, Klaus-Tycho Foerster, Stefan Schmid, and Ramakrishnan Durairajan. A survey of reconfigurable optical net- works.Optical Switching and Networking, 41:100621, 2021. URL: https: //www.sciencedirect.com/science/article/pii/S1573427721000187, doi:10.1016/j.osn.2021.100621

work page doi:10.1016/j.osn.2021.100621 2021
[25]

Integrating microsecond circuit switching into the data center

George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Integrating microsecond circuit switching into the data center. InProceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 447–458, New York, NY, USA, 2013. As- sociation for Computing Machin...

work page doi:10.1145/2486001.2486007 2013
[26]

Harvest: Adaptive photonic switching schedules for collective communication in scale-up domains, 2026

Mahir Rahman, Samuel Joseph, Nihar Kodkani, Behnaz Arzani, and Vamsi Addanki. Harvest: Adaptive photonic switching schedules for collective communication in scale-up domains, 2026. URL: https://arxiv.org/abs/2602.09188,arXiv:2602.09188

work page arXiv 2026
[27]

Chronos: Presched- uled circuit switching for llm training

Sundararajan Renganathan and Nick McKeown. Chronos: Presched- uled circuit switching for llm training. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 89–97, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3748273.3749210

work page doi:10.1145/3748273.3749210 2025
[28]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017. URL: https://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Industry insight: photonics to scale ai data centers.npj Nanophotonics, 3(1):8, 2026

Luis Torrijos-Morán and Daniel Pérez-López. Industry insight: photonics to scale ai data centers.npj Nanophotonics, 3(1):8, 2026

2026
[30]

Dynamic Hierarchical Birkhoff-von Neu- mann Decomposition for All-to-All GPU Communication, 2026

Yen-Chieh Wu, Cheng-Shang Chang, Duan-Shin Lee, and H Jonathan Chao. Dynamic Hierarchical Birkhoff-von Neu- mann Decomposition for All-to-All GPU Communication, 2026. URL: https://arxiv.org/abs/2602.22756

work page arXiv 2026
[31]

Actina: Adapting circuit-switching techniques for ai networking architectures

Zhenguo Wu, Benjamin Klenk, Larry Dennison, and Keren Bergman. Actina: Adapting circuit-switching techniques for ai networking architectures. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’25, page 1211–1222, New York, NY, USA, 2025. Association for Computing Machinery.doi:10.1145/371228...

work page doi:10.1145/3712285.3759842 2025
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Mixture-of-Experts with Expert Choice Routing

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-Experts with Expert Choice Routing. In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors,Advances in Neural Information Processing Sys- tems, volume 35, pages 7103–7114. Curran Associates, Inc., 2022...

2022
[34]

Resiliency at scale: Managing Google’s TPUv4 machine learning supercomputer

Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahini. Resiliency at scale: Managing Google’s TPUv4 machine learning supercomputer. In21st USENIX Symposium on Networked Systems Design and Implement...

2024

[1] [1]

When light bends to the collective will: A theory and vision for adaptive photonic scale-up domains

Vamsi Addanki. When light bends to the collective will: A theory and vision for adaptive photonic scale-up domains. InProceedings of the 24th ACM Workshop on Hot Topics in Networks, HotNets ’25, page 326–334, New York, NY, USA, 2025. Association for Computing Machinery.doi:10.1145/3772356.3772395

work page doi:10.1145/3772356.3772395 2025

[2] [2]

Mars: Near-optimal throughput with shallow buffers in reconfigurable datacenter networks.Proc

Vamsi Addanki, Chen Avin, and Stefan Schmid. Mars: Near-optimal throughput with shallow buffers in reconfigurable datacenter networks.Proc. ACM Meas. Anal. Comput. Syst., 7(1), mar 2023. doi:10.1145/3579312

work page doi:10.1145/3579312 2023

[3] [3]

Shale: A practical, scalable oblivious reconfigurable network

Daniel Amir, Nitika Saran, Tegan Wilson, Robert Kleinberg, Vishal Shrivastav, and Hakim Weatherspoon. Shale: A practical, scalable oblivious reconfigurable network. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 449–464, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3651890.3672248

work page doi:10.1145/3651890.3672248 2024

[4] [4]

Reconfigurability within collective communication algorithms

Rukshani Athapathu and George Porter. Reconfigurability within collective communication algorithms. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 43–49, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3748273.3749203

work page doi:10.1145/3748273.3749203 2025

[5] [5]

Revolutionizing datacenter networks via reconfigurable topologies.Commun

Chen Avin and Stefan Schmid. Revolutionizing datacenter networks via reconfigurable topologies.Commun. ACM, 68(6):44–53, June 2025. doi:10.1145/3708980

work page doi:10.1145/3708980 2025

[6] [6]

Sirius: A flat datacenter network with nanosecond optical switching

Hitesh Ballani, Paolo Costa, Raphael Behrendt, Daniel Cletheroe, Istvan Haller, Krzysztof Jozwik, Fotini Karinou, Sophie Lange, Kai Shi, Benn Thomsen, and Hugh Williams. Sirius: A flat datacenter network with nanosecond optical switching. InProceedings of the Annual Conference of the ACM Special Interest Group on Data Communication on the Appli- cations, ...

work page doi:10.1145/3387514.3406221 2020

[7] [7]

Tres observaciones sobre el algebra lin- eal.Univ

G BIRKHOFF. Tres observaciones sobre el algebra lin- eal.Univ. Nac. Tucuman, Ser. A, 5:147–154, 1946. URL: https://cir.nii.ac.jp/crid/1570572699525842816

work page arXiv 1946

[8] [8]

Costly circuits, submodular schedules and approximate carathéodory theorems

Shaileshh Bojja Venkatakrishnan, Mohammad Alizadeh, and Pramod Viswanath. Costly circuits, submodular schedules and approximate carathéodory theorems. InProceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science, SIGMETRICS ’16, page 75–88, New York, NY, USA, 2016. Asso- ciation for Computing Machinery....

work page doi:10.1145/2896377.2901479 2016

[9] [9]

David F. Crouse. On implementing 2d rectangular assignment algorithms.IEEE Transactions on Aerospace and Electronic Systems, 52(4):1679–1696, 2016.doi:10.1109/TAES.2016.140952. 7

work page doi:10.1109/taes.2016.140952 2016

[10] [10]

Helios: a hybrid electrical/optical switch architecture for modular data centers

Nathan Farrington, George Porter, Sivasankar Radhakrishnan, Hamid Hajabdolali Bazzaz, Vikram Subramanya, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Helios: a hybrid electrical/optical switch architecture for modular data centers. InProceedings of the ACM SIGCOMM 2010 Conference, SIGCOMM ’10, page 339–350, New York, NY, USA, 2010. Association for Co...

work page doi:10.1145/1851182.1851223 2010

[11] [11]

Switch transformers: scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23(1), 1 2022

2022

[12] [12]

MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing,

Seokjin Go and Divya Mahajan. MoETuner: Optimized Mixture of Expert Serving with Balanced Expert Placement and Token Routing,

[13] [13]

URL: https://arxiv.org/abs/2502.06643

work page arXiv

[14] [14]

Gurobi Optimizer Reference Manual, 2023

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual, 2023. URL: https://www.gurobi.com

2023

[15] [15]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven ...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Optimal two- and three-stage production schedules with setup times included.Naval Research Logistics Quarterly, 1(1):61–68, 1954

S M Johnson. Optimal two- and three-stage production schedules with setup times included.Naval Research Logistics Quarterly, 1(1):61–68, 1954. URL: https://onlinelibrary .wiley.com/doi/abs/ 10.1002/nav.3800010110,doi:10.1002/nav.3800010110

work page doi:10.1002/nav.3800010110 1954

[17] [17]

Jouppi and Andy Swing

Norman P. Jouppi and Andy Swing. A machine learning supercomputer with an optically reconfigurable interconnect and embeddings support. In2023 IEEE Hot Chips 35 Symposium (HCS), pages 1–24, 2023. doi:10.1109/HCS59251.2023.10254691

work page doi:10.1109/hcs59251.2023.10254691 2023

[18] [18]

Scheduling opportunistic links in two-tiered reconfigurable datacenters

Janardhan Kulkarni, Stefan Schmid, and Paweł Schmidt. Scheduling opportunistic links in two-tiered reconfigurable datacenters. InPro- ceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA ’21, page 318–327, New York, NY, USA, 2021. Asso- ciation for Computing Machinery.doi:10.1145/3409964.3461786

work page doi:10.1145/3409964.3461786 2021

[19] [19]

A case for server-scale photonic connectivity

Abhishek Vijaya Kumar, Arjun Devraj, Darius Bunandar, and Rachee Singh. A case for server-scale photonic connectivity. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, HotNets ’24, page 290–299, New York, NY, USA, 2024. Association for Computing Machinery.doi:10.1145/3696348.3696856

work page doi:10.1145/3696348.3696856 2024

[20] [20]

Mixnet: A runtime reconfigurable optical-electrical fabric for distributed mixture-of-experts training

Xudong Liao, Yijun Sun, Han Tian, Xinchen Wan, Yilun Jin, Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li, Kin Fai Tse, Zhizhen Zhong, Guyue Liu, Ying Zhang, Xiaofeng Ye, Yiming Zhang, and Kai Chen. Mixnet: A runtime reconfigurable optical-electrical fabric for distributed mixture-of-experts training. InProceedings of the ACM SIGCOMM 2025 Conference,...

work page doi:10.1145/3718958.3750465 2025

[21] [21]

Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M

He Liu, Matthew K. Mukerjee, Conglong Li, Nicolas Feltman, George Papen, Stefan Savage, Srinivasan Seshan, Geoffrey M. Voelker, David G. Andersen, Michael Kaminsky, George Porter, and Alex C. Snoeren. Scheduling techniques for hybrid circuit/packet networks. InProceed- ings of the 11th ACM Conference on Emerging Networking Experiments and Technologies, Co...

work page doi:10.1145/2716281.2836126 2015

[22] [22]

Mellette, Alex Forencich, Rukshani Athapathu, Alex C

William M. Mellette, Alex Forencich, Rukshani Athapathu, Alex C. Snoeren, George Papen, and George Porter. Realizing rotornet: Toward practical microsecond scale optical networking. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, page 392–414, New York, NY, USA, 2024. Association for Computing Machinery. doi:10.1145/3651890.3672273

work page doi:10.1145/3651890.3672273 2024

[23] [23]

Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C

William M. Mellette, Rob McGuinness, Arjun Roy, Alex Forencich, George Papen, Alex C. Snoeren, and George Porter. Rotornet: A scal- able, low-complexity, optical datacenter network. InProceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM ’17, page 267–280, New York, NY, USA, 2017. Association for Computing Machiner...

work page doi:10.1145/3098822.3098838 2017

[24] [24]

A survey of reconfigurable optical net- works.Optical Switching and Networking, 41:100621, 2021

Matthew Nance Hall, Klaus-Tycho Foerster, Stefan Schmid, and Ramakrishnan Durairajan. A survey of reconfigurable optical net- works.Optical Switching and Networking, 41:100621, 2021. URL: https: //www.sciencedirect.com/science/article/pii/S1573427721000187, doi:10.1016/j.osn.2021.100621

work page doi:10.1016/j.osn.2021.100621 2021

[25] [25]

Integrating microsecond circuit switching into the data center

George Porter, Richard Strong, Nathan Farrington, Alex Forencich, Pang Chen-Sun, Tajana Rosing, Yeshaiahu Fainman, George Papen, and Amin Vahdat. Integrating microsecond circuit switching into the data center. InProceedings of the ACM SIGCOMM 2013 Conference on SIGCOMM, SIGCOMM ’13, page 447–458, New York, NY, USA, 2013. As- sociation for Computing Machin...

work page doi:10.1145/2486001.2486007 2013

[26] [26]

Harvest: Adaptive photonic switching schedules for collective communication in scale-up domains, 2026

Mahir Rahman, Samuel Joseph, Nihar Kodkani, Behnaz Arzani, and Vamsi Addanki. Harvest: Adaptive photonic switching schedules for collective communication in scale-up domains, 2026. URL: https://arxiv.org/abs/2602.09188,arXiv:2602.09188

work page arXiv 2026

[27] [27]

Chronos: Presched- uled circuit switching for llm training

Sundararajan Renganathan and Nick McKeown. Chronos: Presched- uled circuit switching for llm training. InProceedings of the 2nd Workshop on Networks for AI Computing, NAIC ’25, page 89–97, New York, NY, USA, 2025. Association for Computing Machinery. doi:10.1145/3748273.3749210

work page doi:10.1145/3748273.3749210 2025

[28] [28]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer, 2017. URL: https://arxiv.org/abs/1701.06538

work page internal anchor Pith review Pith/arXiv arXiv 2017

[29] [29]

Industry insight: photonics to scale ai data centers.npj Nanophotonics, 3(1):8, 2026

Luis Torrijos-Morán and Daniel Pérez-López. Industry insight: photonics to scale ai data centers.npj Nanophotonics, 3(1):8, 2026

2026

[30] [30]

Dynamic Hierarchical Birkhoff-von Neu- mann Decomposition for All-to-All GPU Communication, 2026

Yen-Chieh Wu, Cheng-Shang Chang, Duan-Shin Lee, and H Jonathan Chao. Dynamic Hierarchical Birkhoff-von Neu- mann Decomposition for All-to-All GPU Communication, 2026. URL: https://arxiv.org/abs/2602.22756

work page arXiv 2026

[31] [31]

Actina: Adapting circuit-switching techniques for ai networking architectures

Zhenguo Wu, Benjamin Klenk, Larry Dennison, and Keren Bergman. Actina: Adapting circuit-switching techniques for ai networking architectures. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’25, page 1211–1222, New York, NY, USA, 2025. Association for Computing Machinery.doi:10.1145/371228...

work page doi:10.1145/3712285.3759842 2025

[32] [32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Mixture-of-Experts with Expert Choice Routing

Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-Experts with Expert Choice Routing. In S Koyejo, S Mohamed, A Agarwal, D Belgrave, K Cho, and A Oh, editors,Advances in Neural Information Processing Sys- tems, volume 35, pages 7103–7114. Curran Associates, Inc., 2022...

2022

[34] [34]

Resiliency at scale: Managing Google’s TPUv4 machine learning supercomputer

Yazhou Zu, Alireza Ghaffarkhah, Hoang-Vu Dang, Brian Towles, Steven Hand, Safeen Huda, Adekunle Bello, Alexander Kolbasov, Arash Rezaei, Dayou Du, Steve Lacy, Hang Wang, Aaron Wisner, Chris Lewis, and Henri Bahini. Resiliency at scale: Managing Google’s TPUv4 machine learning supercomputer. In21st USENIX Symposium on Networked Systems Design and Implement...

2024