The Price of Anarchy in Disaggregated Inference
Pith reviewed 2026-06-27 04:49 UTC · model grok-4.3
The pith
Disaggregated inference turns GPU pools into competing agents whose selfish routing raises the price of anarchy once saturation hits, but an adaptive controller reduces it by 3.1 times.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Saturation induces regime transitions that shift the payoff structure of the resource, caching, and congestion games, causing the empirical PoA-hat to rise sharply; an adaptive routing controller that switches strategies at the saturation knee reduces PoA-hat 3.1 times on the 70B 1P/5D topology while incurring only a 13 percent throughput penalty.
What carries the argument
The empirical PoA-hat estimator, which measures the ratio of worst-case to optimal latency under selfish routing in the congestion game with positive KV-cache externalities.
If this is right
- Below saturation, selfish caching and routing produce bounded inefficiency.
- At saturation, the same behavior produces superlinear latency growth.
- Real-time detection of the knee allows switching routing policies without new instabilities.
- The same three-regime structure appears on both 340B and 70B models with the same transition point at C=128.
- Adaptive control improves the saturated operating point on both 1P/5D and 1P/2D topologies.
Where Pith is reading between the lines
- Similar game structures may appear in other disaggregated systems such as training or multi-tenant clusters.
- If saturation detection proves noisy, the controller could be replaced by a continuous optimization layer.
- The 13 percent throughput cost might be reduced by hybrid routing that blends both strategies.
- Extending the model to include energy or monetary costs could change the payoff matrices and PoA values.
Load-bearing premise
The empirical PoA-hat estimator accurately reflects the true price of anarchy and that saturation transitions can be detected in real time without introducing new instabilities.
What would settle it
Running the adaptive controller on the 70B 1P/5D topology and observing no drop in measured latency ratio or an increase in throughput loss beyond 13 percent would falsify the claim that the controller reliably improves the operating point.
Figures
read the original abstract
Disaggregated inference architectures physically separate prefill and decode phases onto distinct GPU pools, creating competing "agents" that share a fixed hardware budget. We provide, to our knowledge, the first formal game-theoretic analysis of this architecture, using NVIDIA Dynamo as a concrete case study. We model disaggregated serving as three coupled games: a two-player resource game between prefill and decode pools, a selfish caching game over the hierarchical KV cache, and a congestion game with positive externalities for request routing. We empirically validate the latter two; the P/D resource game is treated analytically (Section 9.2). We characterize how GPU saturation induces regime transitions that shift the game's payoff structure: below saturation, selfish behavior has bounded Price of Anarchy (PoA); at saturation, superlinear latency and cache externalities drive our empirical estimator PoA-hat (defined in Section 6.4) upward. Based on this analysis, we design an adaptive controller that detects saturation transitions in real time and adjusts routing parameters accordingly, shifting from cache-affinity exploitation to load-balanced congestion avoidance. We instantiate our framework on a 3-node NVIDIA B200 cluster running Dynamo with two models, Nemotron-4-340B (TP=8, full-node workers with cross-InfiniBand KV transfers) and Llama-3.1-70B (TP=4), and find the same three-regime PoA-hat structure with the same first post-knee grid point (C=128) on both models. Adaptive routing shifts each model to a better operating point. Our strongest result is on the 70B 1P/5D topology, where PoA-hat drops 3.1x (66.4 to 21.5) in the saturated phase at a 13% throughput cost. On the 70B 1P/2D, PoA-hat drops 2.2x and TTFT P99 drops 7.6x (see Section 8.5).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to deliver the first game-theoretic analysis of disaggregated inference (prefill/decode separation on distinct GPU pools) by modeling it as three coupled games: a two-player resource game, a selfish KV-cache game, and a routing congestion game. It identifies saturation-induced regime transitions that increase an empirical PoA estimator (PoA-hat, defined in §6.4), designs a real-time adaptive routing controller that switches strategies at the detected knee (C=128), and reports that the controller reduces PoA-hat by 3.1× (66.4→21.5) on the 70B 1P/5D topology at 13% throughput cost, with consistent three-regime structure observed on both Nemotron-4-340B and Llama-3.1-70B.
Significance. If the empirical PoA-hat estimator is shown to track the true game-theoretic Price of Anarchy and the saturation detector proves robust, the work supplies a principled basis for routing controllers in production disaggregated serving systems. The consistency of the C=128 knee across two models and the analytical treatment of the P/D resource game (§9.2) are strengths; the empirical validation of the caching and routing games adds concrete grounding.
major comments (2)
- [§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.
- [§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.
minor comments (2)
- [Abstract] The abstract states that the P/D resource game is treated analytically in §9.2 while the other two games are only empirically validated; a brief statement of the analytical result (e.g., closed-form equilibrium or bound) would help readers assess the completeness of the game-theoretic contribution.
- Notation for the three games and the saturation threshold C is introduced without a consolidated table; a short nomenclature table would improve readability.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below, clarifying the empirical nature of our PoA-hat estimator while agreeing to strengthen the validation of the adaptive controller.
read point-by-point responses
-
Referee: [§6.4] §6.4: PoA-hat is defined as an empirical ratio computed from the same experimental runs that later demonstrate the controller gains; the paper supplies no independent theoretical computation of the true PoA (supremum over equilibria divided by OPT) nor cross-validation of the estimator against an external baseline. Consequently the headline 3.1× reduction reported in §8.5 on the 70B 1P/5D topology rests on an unverified proxy.
Authors: We agree that PoA-hat is computed from the same runs used to evaluate the controller and that no closed-form theoretical PoA is provided. Section 6.4 explicitly defines PoA-hat as an empirical ratio (worst observed performance over best observed) under the three modeled games, chosen because the continuous strategy spaces and coupled dynamics preclude exact computation of the supremum over all equilibria. The 3.1× reduction is therefore an empirical demonstration of the controller's effect on this practical proxy, supported by the identical three-regime structure and C=128 knee observed on two different models. We will revise §6.4 and the discussion in §8.5 to state explicitly that PoA-hat is an estimator, explain the intractability of the theoretical PoA, and note the absence of external cross-validation. revision: partial
-
Referee: [§8.5] §8.5 and §9.2: The adaptive controller relies on real-time saturation detection to switch routing parameters, yet the manuscript contains no sensitivity analysis of detector error rates, no examination of transient instabilities introduced by the regime switch, and no quantification of how approximation in the “optimal” routing baseline propagates into the measured PoA-hat ratio. These omissions directly affect the claim that the controller reliably moves the system to a better operating point.
Authors: The detector threshold (C=128) was selected from the observed knee that was consistent across both Nemotron-4-340B and Llama-3.1-70B. We acknowledge the lack of sensitivity, transient, and baseline-approximation analysis. We will add a new subsection to the revised §8.5 containing (i) Monte-Carlo evaluation of detector error rates under load variation, (ii) trace-driven simulation of latency transients at regime switches, and (iii) bounds on how the routing-baseline approximation affects the reported PoA-hat ratios. These additions will quantify the robustness of the claimed operating-point improvement. revision: yes
- Independent theoretical computation of the exact Price of Anarchy (supremum over equilibria divided by OPT) for the coupled continuous-state games, which remains intractable given the model complexity.
Circularity Check
No significant circularity; empirical PoA-hat estimator is measured outcome, not input to derivation.
full rationale
The paper derives a game-theoretic model of three coupled games from first principles, analytically treats the P/D resource allocation game, and designs an adaptive controller from the resulting regime analysis. PoA-hat is introduced as a post-hoc empirical estimator (Sec 6.4) whose values are computed from experimental runs after the controller is applied; the reported reductions (e.g., 66.4 to 21.5) are therefore measured effects rather than quantities that were fitted or defined into the model. No self-citations, uniqueness theorems, or ansatzes are invoked to close the central claim. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- saturation knee point C =
128
axioms (1)
- domain assumption Prefill and decode pools behave as selfish agents maximizing individual performance metrics
invented entities (1)
-
PoA-hat empirical estimator
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gulavani, Ramachandran Ramjee, and Alexey Tumanov
Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee, and Alexey Tumanov. Vidur: A large-scale simulation framework for LLM inference. InProceedings of Machine Learning and Systems (MLSys), 2024
2024
-
[2]
Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011
Sebastian Aland, Dominic Dumrauf, Martin Gairing, Burkhard Monien, and Florian Schopp- mann. Exact price of anarchy for polynomial congestion games.SIAM Journal on Computing, 40(5):1211–1233, 2011
2011
-
[3]
The price of routing unsplittable flow
Baruch Awerbuch, Yossi Azar, and Amir Epstein. The price of routing unsplittable flow. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 57–66, 2005
2005
-
[4]
Settling the complexity of two-player Nash equilibrium
Xi Chen and Xiaotie Deng. Settling the complexity of two-player Nash equilibrium. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 261–272, 2006
2006
-
[5]
The price of anarchy of finite congestion games
George Christodoulou and Elias Koutsoupias. The price of anarchy of finite congestion games. InProceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 67–73, 2005
2005
-
[6]
Selfish caching in distributed systems: a game-theoretic analysis
Byung-Gon Chun, Kamalika Chaudhuri, Hoeteck Wee, Marco Barreno, Christos H Papadim- itriou, and John Kubiatowicz. Selfish caching in distributed systems: a game-theoretic analysis. InProceedings of the 23rd Annual ACM Symposium on Principles of Distributed Computing (PODC), pages 21–30, 2004
2004
-
[7]
When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020
Riccardo Colini-Baldeschi, Roberto Cominetti, Panayotis Mertikopoulos, and Marco Scarsini. When is selfish routing bad? The price of anarchy in light and heavy traffic.Operations Research, 68(2):411–434, 2020
2020
-
[8]
Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023
Roberto Cominetti, Marco Scarsini, Marc Schröder, and Nicolás E Stier-Moses. Approximation and convergence of large atomic congestion games.Mathematics of Operations Research, 48(2): 784–811, 2023. 35
2023
-
[9]
The complexity of computing a Nash equilibrium
Constantinos Daskalakis, Paul W Goldberg, and Christos H Papadimitriou. The complexity of computing a Nash equilibrium. InProceedings of the 38th Annual ACM Symposium on Theory of Computing (STOC), pages 71–78, 2006
2006
-
[10]
Finding social optima in congestion games with positive externalities
Bart de Keijzer and Guido Schäfer. Finding social optima in congestion games with positive externalities. InProceedings of the 20th European Symposium on Algorithms (ESA), volume 7501 ofLNCS, pages 395–406. Springer, 2012
2012
-
[11]
Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007
Francisco Facchinei and Christian Kanzow. Generalized Nash equilibrium problems.4OR, 5(3): 173–210, 2007
2007
-
[12]
Fatemeh Fardno and S. Rasoul Etesami. A game-theoretic framework for distributed load balancing: Static and dynamic game models.arXiv preprint arXiv:2501.15324, 2025. To appear in Automatica, 2026
-
[13]
Incentives in dominant resource fair allocation under dynamic demands
Giannis Fikioris, Rachit Agarwal, and Éva Tardos. Incentives in dominant resource fair allocation under dynamic demands. InSymposium on Algorithmic Game Theory (SAGT), 2024
2024
-
[14]
The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023
Jason Gaitonde and Éva Tardos. The price of anarchy of strategic queuing systems.Journal of the ACM, 70(3):20:1–20:63, 2023
2023
-
[15]
Dominant resource fairness: Fair allocation of multiple resource types
Ali Ghodsi, Matei Zaharia, Benjamin Hindman, Andy Konwinski, Scott Shenker, and Ion Stoica. Dominant resource fairness: Fair allocation of multiple resource types. InProceedings of the 8th USENIX Conference on Networked Systems Design and Implementation (NSDI), pages 323–336, 2011
2011
-
[16]
The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014
Gail Gilboa-Freedman, Refael Hassin, and Yoav Kerner. The price of anarchy in the Markovian single server queue.IEEE Transactions on Automatic Control, 59(2):455–459, 2014
2014
-
[17]
KV Pareto: Systems-level optimization of KV cache and model compression for long context inference
Sai Gokhale, Devleena Das, Rajeev Patwari, Ashish Sirasao, and Elliott Delaye. KV Pareto: Systems-level optimization of KV cache and model compression for long context inference. arXiv preprint arXiv:2512.01953, 2025
-
[18]
Aaron Grattafiori et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
The price of anarchy in an exponential multi-server
Moshe Haviv and Tim Roughgarden. The price of anarchy in an exponential multi-server. Operations Research Letters, 35(4):421–426, 2007
2007
-
[20]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023
2023
-
[21]
Selfish caching games on directed graphs
Qian Ma, Edmund Yeh, and Jianwei Huang. Selfish caching games on directed graphs. IEEE/ACM Transactions on Networking, 29(2):709–722, 2021
2021
-
[22]
Themis: Fair and efficient GPU cluster scheduling
Kshiteej Mahajan, Arjun Balasubramanian, Arjun Singhvi, Shivaram Venkataraman, Aditya Akella, Amar Phanishayee, and Shuchi Chawla. Themis: Fair and efficient GPU cluster scheduling. InProceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), pages 289–304, 2020. 36
2020
-
[23]
Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996
Igal Milchtaich. Congestion games with player-specific payoff functions.Games and Economic Behavior, 13(1):111–124, 1996
1996
-
[24]
Potential games.Games and Economic Behavior, 14(1): 124–143, 1996
Dov Monderer and Lloyd S Shapley. Potential games.Games and Economic Behavior, 14(1): 124–143, 1996
1996
-
[25]
Cambridge University Press, 2007
Noam Nisan, Tim Roughgarden, Éva Tardos, and Vijay V Vazirani.Algorithmic Game Theory. Cambridge University Press, 2007
2007
-
[26]
NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025
NVIDIA Corporation. NVIDIA Dynamo: Disaggregated inference serving for LLMs.https: //developer.nvidia.com/dynamo, 2025. Version 0.9.0, accessed June 2026
2025
-
[27]
arXiv preprint arXiv:2311.18677 , year=
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024. arXiv:2311.18677
-
[28]
Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing
Giovanni Perin, Gianluca Fighera, and Leonardo Badia. Comparison of Nash bargaining and myopic equilibrium for resources allocation in cloud computing. InIEEE Global Communications Conference (GLOBECOM), pages 1–6. IEEE, 2019
2019
-
[29]
Existence and uniqueness of equilibrium points for concave N-person games
J Ben Rosen. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica, 33(3):520–534, 1965
1965
-
[30]
A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973
Robert W Rosenthal. A class of games possessing pure-strategy Nash equilibria.International Journal of Game Theory, 2(1):65–67, 1973
1973
-
[31]
Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015
Tim Roughgarden. Intrinsic robustness of the price of anarchy.Journal of the ACM, 62(5): 1–42, 2015
2015
-
[32]
How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002
Tim Roughgarden and Éva Tardos. How bad is selfish routing?Journal of the ACM, 49(2): 236–259, 2002
2002
-
[33]
FlexGen: High-throughput generative inference of large language models with a single GPU
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. FlexGen: High-throughput generative inference of large language models with a single GPU. InProceedings of the 40th International Conference on Machine Learning (ICML), 2023
2023
-
[34]
Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving
Chao Wang, Pengfei Zuo, Zhangyu Chen, Yunkai Liang, Zhou Yu, and Ming-Chang Yang. Prefill-decode aggregation or disaggregation? Unifying both for goodput-optimized LLM serving. arXiv preprint arXiv:2508.01989, 2025
-
[35]
Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952
John Glen Wardrop. Some theoretical aspects of road traffic research.Proceedings of the Institution of Civil Engineers, 1(3):325–362, 1952
1952
-
[36]
Minrui Xu, Dusit Niyato, and Christopher G Brinton. Serving long-context LLMs at the mobile edge: Test-time reinforcement learning-based model caching and inference offloading.arXiv preprint arXiv:2501.14205, 2025
-
[37]
Tianhao Xu, Yiming Liu, et al. AIConfigurator: Lightning-fast configuration optimization for multi-framework LLM serving.arXiv preprint arXiv:2601.06288, 2026. 37
-
[38]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024
2024
-
[39]
Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning
Pengfei Zheng, Rui Pan, Tarannum Khan, Shivaram Venkataraman, and Aditya Akella. Shock- wave: Fair and efficient cluster scheduling for dynamic adaptation in machine learning. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023
2023
-
[40]
DistServe: Disaggregating Prefill and Decoding for Goodput- optimized Large Language Model Serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. InProceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 193–210, 2024. arXiv:2401.09670. 38
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.