LiveStack: OS Support for Cluster-Scale Full-Stack Live Simulation
Pith reviewed 2026-06-26 19:11 UTC · model grok-4.3
The pith
LiveStack extends Linux virtualization with four subsystems to run unmodified production stacks in cluster-scale simulation while preserving both fidelity and iterative speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LiveStack is an OS-level approach to cluster-scale full-stack simulation built on top of the Linux virtualization stack. LiveStack comprises four subsystems: simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration. Together, they coordinate live and modeled components under shared simulated time while controlling interference among co-located live hosts. These mechanisms point toward simulation-native OS support, where simulation control and orchestration become core OS responsibilities.
What carries the argument
Four coordinated subsystems (simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration) that keep live and modeled components synchronized under a single simulated timeline.
If this is right
- Unmodified production software stacks can be evaluated at cluster scale with full fidelity.
- Iterative exploration of hardware and software configurations becomes feasible at usable speeds.
- Interference among multiple live hosts sharing simulation resources remains controllable.
- Simulation orchestration can be treated as a native operating-system service rather than an external layer.
Where Pith is reading between the lines
- The same coordination pattern could be adapted to other virtualization or container runtimes beyond Linux.
- Shared simulated time might simplify debugging of timing-sensitive distributed bugs that are hard to reproduce on real clusters.
- Once simulation is inside the OS, hardware-in-the-loop experiments could be scheduled alongside ordinary workloads without separate toolchains.
Load-bearing premise
The Linux virtualization stack can be extended with the four described mechanisms without unacceptable interference or loss of fidelity when live and simulated components share the same resources.
What would settle it
A direct comparison in which an unmodified production distributed application running under LiveStack exhibits either fidelity loss relative to bare hardware or simulation throughput too low for iterative configuration sweeps would disprove the central claim.
Figures
read the original abstract
Cluster-scale full-stack simulation is essential for evaluating distributed software stacks and emerging hardware components before deployment. Such simulation must achieve both full-stack fidelity for the unmodified production stack and the simulation performance required for iterative configuration exploration. However, no existing method achieves both. We present LiveStack, an OS-level approach to cluster-scale full-stack simulation built on top of the Linux virtualization stack. LiveStack comprises four subsystems: simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration. Together, they coordinate live and modeled components under shared simulated time while controlling interference among co-located live hosts. These mechanisms point toward simulation-native OS support, where simulation control and orchestration become core OS responsibilities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LiveStack, an OS-level approach to cluster-scale full-stack live simulation built on the Linux virtualization stack. It comprises four subsystems—simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration—that coordinate live and modeled components under shared simulated time while controlling interference among co-located live hosts, with the goal of achieving both full-stack fidelity for unmodified production stacks and the performance needed for iterative configuration exploration.
Significance. If the subsystems deliver the claimed fidelity and performance without unacceptable interference, the work would be significant for enabling pre-deployment evaluation of distributed software stacks and emerging hardware at cluster scale. It proposes making simulation control and orchestration core OS responsibilities, addressing a gap where existing methods fail to achieve both requirements simultaneously.
major comments (1)
- [Abstract] Abstract: The central claim that the four subsystems achieve both full-stack fidelity and required simulation performance rests on the unverified assumption that the Linux virtualization stack can be extended with the described mechanisms without loss of fidelity or unacceptable interference; the manuscript provides no implementation details, evaluation data, or error analysis to substantiate this.
Simulated Author's Rebuttal
We thank the referee for the review and the identification of this issue with the abstract. We address the comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the four subsystems achieve both full-stack fidelity and required simulation performance rests on the unverified assumption that the Linux virtualization stack can be extended with the described mechanisms without loss of fidelity or unacceptable interference; the manuscript provides no implementation details, evaluation data, or error analysis to substantiate this.
Authors: We agree that the abstract states the central claim without sufficient qualification. The manuscript is a design paper whose contribution is the description of the four subsystems and their coordination under shared simulated time. It contains no implementation, no performance measurements, and no error analysis. We will revise the abstract to state that LiveStack is a proposed OS architecture whose mechanisms are designed to achieve the stated goals, with the design arguments for fidelity and interference control presented in the body; we will also add an explicit statement that empirical validation remains future work. This change will remove the unsubstantiated claim from the abstract. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a systems architecture for LiveStack, an OS-level approach built on the Linux virtualization stack, comprising four described subsystems (simulation-oriented scheduling, live memory hierarchy management, simulation-aware IPC, and distributed simulation orchestration) that coordinate live and modeled components. No equations, fitted parameters, predictions, or derivation chains appear in the provided text. Claims rest on the design of these mechanisms rather than any reduction to prior fitted quantities or self-citation chains. The architecture is self-contained as a proposal for new OS responsibilities, with no load-bearing step that equates outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Linux virtualization stack provides a suitable base for adding simulation-oriented scheduling, memory management, IPC, and orchestration without breaking production fidelity.
Reference graph
Works this paper leans on
-
[1]
Ultra Ethernet Specification v1.0.1.https://ultraethernet.org/ uec-1-0-specAccessed: 2026-05-18
2025. Ultra Ethernet Specification v1.0.1.https://ultraethernet.org/ uec-1-0-specAccessed: 2026-05-18
2025
-
[2]
Ultra Accelerator Link Specifications.https://ualinkconsortium
2026. Ultra Accelerator Link Specifications.https://ualinkconsortium. org/specification/. Accessed: 2026-05-18
2026
-
[3]
2024.AMD64 Architecture Programmer’s Manual
Advanced Micro Devices, Inc. 2024.AMD64 Architecture Programmer’s Manual. Accessed: 2026-05-19
2024
-
[4]
Apache Software Foundation. 2026. Apache Hadoop: Open-Source Framework for Distributed Storage and Processing.https://hadoop. apache.org/Accessed: 2026-05-18
2026
-
[5]
Apache Software Foundation. 2026. Apache Spark: Unified Analytics Engine for Large-Scale Data Processing.https://spark.apache.org/ Accessed: 2026-05-18
2026
-
[6]
Songyuan Bai, Hao Zheng, Chen Tian, Xiaoliang Wang, Chang Liu, Xin Jin, Fu Xiao, Qiao Xiang, Wanchun Dou, and Guihai Chen. 2024. Unison: A Parallel-Efficient and User-Transparent Network Simula- tion Kernel. InProceedings of the Nineteenth European Conference on Computer Systems (EuroSys ’24)
2024
-
[7]
Ben Romdhanne Bilel and Nikaein Navid. 2012. Cunetsim: A gpu based simulation testbed for large scale mobile networks. In2012 In- ternational Conference on Communications and Information Technology (ICCIT). IEEE, 374–378
2012
-
[8]
T. L. Borden, J. P. Hennessy, and J. W. Rymarczyk. 1989. Multiple Operating Systems on One Processor Complex.IBM Systems Journal 28, 1 (1989), 104–123.https://doi.org/10.1147/sj.281.0104
-
[9]
CXL Consortium. 2025. Compute Express Link (CXL) Specification Revision 4.0.https://www.computeexpresslink.org/download-the- specification. Accessed: 2026-05-19
2025
-
[10]
Fares Elsabbagh, Shabnam Sheikhha, Victor A Ying, Quan M Nguyen, Joel S Emer, and Daniel Sanchez. 2023. Accelerating rtl simulation with hardware-software co-design. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 153–166
2023
-
[11]
Miquel Ferriol-Galmés, Jordi Paillisse, José Suárez-Varela, Krzysztof Rusek, Shihan Xiao, Xiang Shi, Xiangle Cheng, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2023. RouteNet-Fermi: Network modeling with graph neural networks.IEEE/ACM transactions on networking31, 6 (2023), 3080–3095
2023
-
[12]
Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, et al. 2018. Azure accelerated networking:{SmartNICs} in the public cloud. In15th USENIX Sym- posium on Networked Systems Design and Implementation (NSDI 18). 51–66
2018
-
[13]
Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Xizheng Wang, Ran Zhang, and Lu Lu. 2023. Dons: Fast and affordable discrete event network simulation with automatic parallelization. InProceedings of the ACM SIGCOMM 2023 Conference. 167–181
2023
-
[14]
gem5 Project. 2026. The gem5 Simulator Project.https://www.gem5. org/. Accessed: 2026-05-19
2026
-
[15]
Fei Gui, Kaihui Gao, Li Chen, Dan Li, Vincent Liu, Ran Zhang, Hong- bing Yang, and Dian Xiong. 2025. Accelerating Design Space Explo- ration for {LLM} Training Systems with Multi-experiment Parallel Simulation. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 473–488
2025
-
[16]
Zizheng Guo, Yanqing Zhang, Runsheng Wang, Yibo Lin, and Haoxing Ren. 2025. GEM: GPU-accelerated emulator-inspired RTL simulation. In2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 1–7
2025
-
[17]
2026.Intel(R) 64 and IA-32 Architectures Software Developer’s Manual
Intel Corporation. 2026.Intel(R) 64 and IA-32 Architectures Software Developer’s Manual. Accessed: 2026-05-19
2026
-
[18]
Intel Corporation. 2026. Introduction to Memory Bandwidth Alloca- tion.https://www.intel.com/content/www/us/en/developer/articles/ technical/introduction-to-memory-bandwidth-allocation.html. Ac- cessed: 2026-05-20
2026
-
[19]
Intel Simics. 2026. Simics Full-System Simulator.https://www. intel.com/content/www/us/en/developer/articles/tool/simics- simulator.html. Accessed: 2026-05-19
2026
-
[20]
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al . 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture. 1–12
2017
-
[21]
Sagar Karandikar, Howard Mao, Donggyu Kim, David Biancolin, Alon Amid, Dayeol Lee, Nathan Pemberton, Emmanuel Amaro, Colin Schmidt, Aditya Chopra, et al . 2018. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud. In2018 ACM/IEEE 45th Annual International Symposium on Computer Archi- tecture (ISCA). IEEE, 29–42
2018
-
[22]
Eric Keller, Jakub Szefer, Jennifer Rexford, and Ruby B. Lee. 2010. NoHype: Virtualized Cloud Infrastructure without the Virtualization. InProceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). 350–361.https://doi.org/10.1145/1815961. 1816010
-
[23]
Sajy Khashab, Hariharan Sezhiyan, Rani Abboud, Alex Normatov, Stefan Kaestle, Eliav Bar-Ilan, Mohammad Nassar, Omer Shabtai, Wei Bai, Matty Kadosh, et al. 2025. NSX: Large-Scale Network Simulation on an AI Server. InProceedings of the 2nd Workshop on Networks for AI Computing. 19–25
2025
-
[24]
Chenning Li, Arash Nasr-Esfahany, Kevin Zhao, Kimia Noorbakhsh, Prateesh Goyal, Mohammad Alizadeh, and Thomas E Anderson. 2024. m3: Accurate flow-level performance estimation using machine learn- ing. InProceedings of the ACM SIGCOMM 2024 Conference. 813–827
2024
-
[25]
Hejing Li, Jialin Li, and Antoine Kaufmann. 2022. SimBricks: end-to- end network system evaluation with modular simulation. InProceed- ings of the ACM SIGCOMM 2022 Conference. 380–396
2022
-
[26]
Hejing Li, Marvin Meiers, Jialin Li, and Antoine Kaufmann. 2025. SplitSim: Towards Practical Large-Scale Full-System Simulation for Systems Research.Proceedings of the ACM on Networking3, CoNEXT4 (2025), 1–19.https://doi.org/10.1145/3768999
-
[27]
Qinyong Li, Zhiwei Zhao, Geyong Min, Zi Wang, and Luwei Fu. 2026. GeDES: GPU-Driven Discrete Event Network Simulator. InProceedings of the 21st European Conference on Computer Systems. 468–483
2026
-
[28]
Tiantian Lin, Cheng Qiu, Xiaohang Wang, Ling Wang, Zhulin Zheng, Yingtao Jiang, Amit Kumar Singh, Jieming Yin, Sihai Qiu, Xiaodong Li, et al. 2025. LEGOSim: A Unified Parallel Simulation Framework for Multi-chiplet Heterogeneous Integration. InProceedings of the 58th IEEE/ACM International Symposium on Microarchitecture. 1347–1362
2025
-
[29]
Jiacheng Ma, Jonas Kaufmann, Emilien Guandalino, Rishabh Iyer, Bourgeat Thomas, and George Candea. 2025. Fast End-to-End Per- formance Simulation of Accelerated Hardware–Software Stacks. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles (SOSP ’25)
2025
-
[30]
Tie Ma, Long Luo, Hongfang Yu, Xi Chen, Jingzhao Xie, Chongxi Ma, Yunhan Xie, Gang Sun, Tianxi Wei, Li Chen, et al. 2024. Klonet: an {Easy-to-Use} and scalable platform for computer networks educa- tion. In21st USENIX Symposium on Networked Systems Design and 6 LiveStack : OS Support for Cluster-Scale Full-Stack Live Simulation Implementation (NSDI 24). 2025–2046
2024
-
[31]
José Martins, Adriano Tavares, Marco Solieri, Marko Bertogna, and Sandro Pinto. 2020. Bao: A Lightweight Static Partitioning Hypervi- sor for Modern Multi-Core Embedded Systems. InWorkshop on Next Generation Real-Time Embedded Systems (NG-RES), co-located with HiPEAC. 3:1–3:14.https://doi.org/10.4230/OASIcs.NG-RES.2020.3
-
[32]
2016.NVIDIA NVLink: High-Speed GPU Inter- connect
NVIDIA Corporation. 2016.NVIDIA NVLink: High-Speed GPU Inter- connect. Technical Report. Accessed: 2026-05-18
2016
-
[33]
NVIDIA Corporation. 2026. NVIDIA Virtual GPU (vGPU) Software. https://docs.nvidia.com/vgpu/. Accessed: 2026-05-19
2026
-
[34]
OpenInfra Foundation. 2026. The Most Widely Deployed Open Source Cloud Software in the World.https://www.openstack.orgAccessed: 2026-05-19
2026
-
[35]
QEMU Project. 2026. QEMU: The Fast Processor Emulator.https: //www.qemu.org/. Accessed: 2026-05-18
2026
-
[36]
Yicheng Qian, Ran Shu, Rui Ma, Yang Wang, Derek Chiou, Nadeen Gebara, Luca Piccolboni, Miriam Leeser, and Yongqiang Xiong. 2025. Miniature: Fast AI Supercomputer Networks Simulation on FPGAs. In Proceedings of the 9th Asia-Pacific Workshop on Networking. 114–120
2025
-
[37]
Jianxing Qin, Jingrong Chen, Xinhao Kong, Yongji Wu, Tianjun Yuan, Liang Luo, Zhaodong Wang, Ying Zhang, Tingjun Chen, Alvin R Lebeck, et al. 2026. Phantora: Maximizing Code Reuse in Simulation- based Machine Learning System Performance Estimation. In23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI 26). 1809–1825
2026
-
[38]
Ralf Ramsauer, Jan Kiszka, Daniel Lohmann, and Wolfgang Mauerer
-
[39]
Look Mum, no VM Exits! (Almost). InProceedings of the 13th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT).https://arxiv.org/abs/1705.06932
-
[40]
George F. Riley and Thomas R. Henderson. 2010. The ns-3 Network Simulator. InModeling and Tools for Network Simulation. Springer, 15–34.https://doi.org/10.1007/978-3-642-12331-3_2
-
[41]
Krzysztof Rusek, José Suárez-Varela, Paul Almasan, Pere Barlet-Ros, and Albert Cabellos-Aparicio. 2020. RouteNet: Leveraging graph neural networks for network modeling and optimization in SDN.IEEE Journal on Selected Areas in Communications38, 10 (2020), 2260–2270
2020
-
[42]
Weihang Shen, Mingcong Han, Jialong Liu, Rong Chen, and Haibo Chen. 2025. {XSched}: Preemptive Scheduling for Diverse {XPUs }. In19th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 25). 671–692
2025
-
[43]
Jovan Stojkovic, Abraham Farrell, Zhangxiaowen Gong, Christopher J Hughes, and Josep Torrellas. 2026. AccelFlow: Orchestrating an On- Package Ensemble of Fine-Grained Accelerators for Microservices. In 2026 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1–17
2026
-
[44]
Yan Sun, Yifan Yuan, Zeduo Yu, Reese Kuper, Chihun Song, Jinghan Huang, Houxiang Ji, Siddharth Agarwal, Jiaqi Lou, Ipoom Jeong, et al
-
[45]
InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture
Demystifying cxl memory with genuine cxl-ready systems and devices. InProceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 105–121
-
[46]
The libvirt Project. 2026. libvirt virtualization API.https://libvirt.org/ Accessed: 2026-05-19
2026
-
[47]
The Linux Foundation. 2026. Kubernetes: Production-Grade Container Orchestration.https://kubernetes.io/Accessed: 2026-05-19
2026
-
[48]
The Linux Kernel Foundation. 2026. Cpusets.https://docs.kernel.org/ admin-guide/cgroup-v1/cpusets.html. Accessed: 2026-05-20
2026
-
[49]
The Linux Kernel Foundation. 2026. KVM: Kernel-based Virtual Ma- chine.https://www.linux-kvm.org/. Accessed: 2026-05-18
2026
-
[50]
The Linux Kernel Foundation. 2026. KVM PVclock.https://www.linux- kvm.org/page/KVMClock. Accessed: 2026-05-20
2026
-
[51]
András Varga and Rudolf Hornig. 2008. An Overview of the OM- NeT++ Simulation Environment. InProceedings of the 1st International Conference on Simulation Tools and Techniques for Communications, Networks and Systems (SIMUTools ’08). ICST, Marseille, France, 1–10. https://doi.org/10.4108/ICST.SIMUTOOLS2008.3027
-
[52]
Xizheng Wang, Qingxu Li, Yichi Xu, Gang Lu, Dan Li, Li Chen, Heyang Zhou, Linkang Zheng, Sen Zhang, Yikai Zhu, et al. 2025. {SimAI}: uni- fying architecture design and performance tuning for {Large-Scale} large language model training with scalability and precision. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). 541–558
2025
-
[53]
Qingqing Yang, Xi Peng, Li Chen, Libin Liu, Jingze Zhang, Hong Xu, Baochun Li, and Gong Zhang. 2022. DeepQueueNet: Towards scalable and generalized network performance estimation with packet-level visibility. InProceedings of the ACM SIGCOMM 2022 Conference. 441– 457
2022
-
[54]
Qizhen Zhang, Kelvin KW Ng, Charles Kazer, Shen Yan, João Sedoc, and Vincent Liu. 2021. MimicNet: Fast performance estimates for data center networks with machine learning. InProceedings of the 2021 ACM SIGCOMM 2021 Conference. 287–304. 7
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.