pith. sign in

arxiv: 2605.22936 · v1 · pith:6OLABPRSnew · submitted 2026-05-21 · 💻 cs.AR · cs.PF

ACALSim: A Scalable Parallel Simulation Framework for High-Performance System Design Space Exploration

Pith reviewed 2026-05-25 05:31 UTC · model grok-4.3

classification 💻 cs.AR cs.PF
keywords architectural simulationparallel simulation frameworkGPU simulatordesign space explorationthread managementevent-driven simulationHPCSimA100 simulation
0
0 comments X

The pith

ACALSim supplies a pluggable thread-management architecture that lets simulator developers add custom scheduling to reach over 14x faster intra-node scaling than SST on large GPU workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ACALSim as a simulation framework built to handle the scale of modern GPUs and AI accelerators with hundreds or thousands of components. Its central addition is an architecture that exposes thread-management hooks so developers can write workload-specific schedulers, something prior frameworks do not provide. The framework also supplies event-driven execution with fast-forwarding, a shared-memory model for zero-copy data exchange, and a two-phase execution model that supports deterministic scaling. These features are shown in HPCSim, a GPU simulator for A100-class hardware, where ACALSim finishes full LLaMA transformer layers that an equivalent SST setup cannot complete.

Core claim

ACALSim is a scalable parallel simulation framework whose key innovation is a pluggable thread-management architecture that lets developers implement custom scheduling strategies tailored to specific simulation patterns. It adds event-driven execution with fast-forward to remove idle-cycle overhead, a shared-memory data model for zero-copy communication, and a two-phase parallel execution model for deterministic thread scaling. When HPCSim is built on top of it for A100-class architectures, the framework delivers over 14x speedup and 41 percent lower memory footprint versus an SST implementation that uses identical shared timing cores, while SST fails to finish 256-plus thread-block runs and

What carries the argument

The pluggable thread-management architecture that exposes hooks for developers to implement custom scheduling strategies tailored to their workloads.

If this is right

  • Design-space exploration becomes feasible for full transformer layers on A100-class hardware that SST cannot simulate in practical time.
  • Simulator developers gain the ability to match thread scheduling to the dataflow patterns of specific accelerators without rewriting the core engine.
  • Hardware correlation studies can now be performed on larger models, with reported cycle-count accuracy of 0.72 to 1.22 times measured A100 values.
  • Memory footprint reductions of 41 percent allow single-node runs that would otherwise require multi-node MPI setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pluggable interface is adopted, other simulation domains such as CPU or network modeling could add similar custom schedulers to improve their own intra-node scaling.
  • The zero-copy shared-memory model may reduce communication costs enough to make single-node execution competitive with multi-node approaches for mid-sized accelerator designs.
  • Workload-specific scheduling could expose new bottlenecks in timing models that were previously masked by fixed threading policies.

Load-bearing premise

The SST comparison uses identical shared timing cores so that the reported speedup and memory reduction come only from ACALSim's framework features and not from differences in workload partitioning or measurement.

What would settle it

A controlled re-run of the same LLaMA workloads in which every non-framework difference between the ACALSim and SST implementations is removed and the speedup disappears.

Figures

Figures reproduced from arXiv: 2605.22936 by Chia-Pao Chiang, Jen-Chien Chang, Wei-Fen Lin, Yen-Po Chen, Yu-Cheng Chang, Yu-Jie Wan, Yu-Yang Lee, Zi-Yi Tai.

Figure 1
Figure 1. Figure 1: ACALSim Framework Block Diagram 2 ACALSIM OVERVIEW 2.1 Core Infrastructure [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ACALSim Simulation Flow (2) Control Thread Bookkeeping: Once all simulators have paused after completing their tasks, the control thread handles essential bookkeeping and coordinates interactions between simulators, such as data sharing, communication, and advancing the global clock to the next cycle. This phase, referred to as Phase 2, runs exclusively on the control thread. Minimizing the duration of thi… view at source ↗
Figure 3
Figure 3. Figure 3: Thread Manager Performance Comparison a configurable grid of processing elements (PEs) interconnected by a mesh-based network-on-chip, connected to shared cache and memory. The test injected traffic from the master CPU to random PEs and simulated the response. To assess scalability, total traffic was fixed while increasing the number of PEs, distributing computation across more simulators. Although neither… view at source ↗
Figure 4
Figure 4. Figure 4: Hybrid ACALSim-SST architecture for multi-GPU simulation. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Multi-rank scalable deployment with 2 SST ranks coordinating 4 GPUs. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: NVSim Simulation Suite Block Diagram [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: DGXSim: CPU Utilization and Performance Scaling [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPUSim: CPU Utilization and Performance Scaling [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GPUSim Parallelism Degree Per Iteration leading to memory-intensive behavior [3]. Compared to DGXSim, GPUSim models a more detailed memory subsystem which incurs high-frequency event and packet simulations. It generates a memory-bound workload in a multi-threaded environment as shown in Figures 8 and 9. The CPU utilization remains below 250% even though, on average, approximately 30+ out of 159 simulators … view at source ↗
Figure 10
Figure 10. Figure 10: Simulation Profiling Results memory-intensive pattern. V6 (thread-local queues) eliminates lock contention, providing the best scaling at high thread counts (Figure 10d). Figures 7 and 8 isolate this effect by running identical workloads with different ThreadManager implementations. 5.2 Simulation-Pattern-Driven Optimization ThreadManagerV1 targets sparse activation patterns in co-simulation environments.… view at source ↗
Figure 11
Figure 11. Figure 11: TaskManager Scheduling Overhead Comparison [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Architectural simulation has become the critical bottleneck limiting design space exploration for high-performance computing systems. Modern GPUs and AI accelerators -- with hundreds to thousands of tightly-coupled components -- demand simulation frameworks that deliver efficient parallelism and scalable single-node execution. Existing frameworks fall short: SST focuses on multi-node MPI scalability but struggles with intra-node scaling, while GPGPU-Sim remains largely single-threaded. Critically, none expose a mechanism for users to optimize threading for their specific workloads. We introduce ACALSim, a scalable parallel simulation framework providing infrastructure and APIs for building high-performance simulators -- timing-model accuracy remains the responsibility of simulator developers. Its key innovation is a pluggable thread-management architecture that lets developers implement custom scheduling strategies tailored to specific simulation patterns, absent in existing frameworks. Complementing it are (1) event-driven execution with fast-forward to eliminate idle-cycle overhead, (2) a shared-memory data model enabling zero-copy communication, and (3) a two-phase parallel execution model for deterministic thread scaling. We demonstrate ACALSim through HPCSim, a GPU simulator targeting A100-class architectures. Against an SST implementation using identical shared timing cores to isolate framework overhead, ACALSim achieves over 14x speedup with 41% lower memory footprint; hardware validation confirms 0.72--1.22x cycle-count correlation with A100 measurements. While SST fails to complete 256+ thread-block workloads within practical time limits, ACALSim simulates full LLaMA transformer layers (single block) in 17.7 minutes for LLaMA-7B and 30.4 minutes for LLaMA-13B -- enabling design space exploration that SST cannot achieve.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ACALSim, a parallel simulation framework for high-performance computing systems featuring a pluggable thread-management architecture, event-driven execution with fast-forward, a shared-memory data model for zero-copy communication, and a two-phase parallel execution model. Demonstrated through the HPCSim GPU simulator targeting A100-class architectures, it claims over 14x speedup and 41% lower memory footprint versus an SST implementation using identical shared timing cores, hardware cycle-count correlation of 0.72-1.22x, and the ability to complete LLaMA-7B/13B transformer layer simulations (17.7 and 30.4 minutes) where SST cannot handle 256+ thread-block workloads.

Significance. If the performance and correlation claims hold after detailed validation, ACALSim would address a critical scalability gap in intra-node architectural simulation for GPUs and AI accelerators, enabling design-space exploration of large workloads that current tools like SST cannot complete in practical time. The pluggable threading API is a concrete advance over existing frameworks.

major comments (2)
  1. [Abstract] Abstract (and implied Results section): The central claim of >14x speedup and 41% memory reduction rests on comparison to 'an SST implementation using identical shared timing cores to isolate framework overhead,' but the manuscript provides no description of the SST port construction, including event scheduling, fast-forward logic, shared-memory data model equivalence, thread-block partitioning, or measurement of completion (wall-clock vs. simulated cycles). This directly affects whether the reported gains can be attributed to the ACALSim framework.
  2. [Abstract] Abstract: Workload definitions, error bars, raw timing data, and hardware measurement methodology for the 0.72-1.22x cycle-count correlation and LLaMA layer timings are absent, which is load-bearing for assessing reproducibility and whether post-hoc choices affect the scalability claims versus SST.
minor comments (1)
  1. The abstract states concrete numerical claims without referencing a dedicated methods or experimental setup subsection; adding one would improve clarity even if the core claims are sound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript requires additional methodological details to support the performance and correlation claims, and we will revise the paper to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract (and implied Results section): The central claim of >14x speedup and 41% memory reduction rests on comparison to 'an SST implementation using identical shared timing cores to isolate framework overhead,' but the manuscript provides no description of the SST port construction, including event scheduling, fast-forward logic, shared-memory data model equivalence, thread-block partitioning, or measurement of completion (wall-clock vs. simulated cycles). This directly affects whether the reported gains can be attributed to the ACALSim framework.

    Authors: We acknowledge that the current manuscript does not provide sufficient detail on the SST baseline construction. In the revised version we will add a dedicated subsection (likely in Section 5 or a new Appendix) that describes the SST port, including the event scheduler implementation, fast-forward logic, shared-memory data model, thread-block partitioning approach, and the exact measurement protocol (wall-clock time versus simulated cycles). This will make explicit how identical timing cores were used and allow readers to attribute the reported speedups and memory reductions to ACALSim's framework features. revision: yes

  2. Referee: [Abstract] Abstract: Workload definitions, error bars, raw timing data, and hardware measurement methodology for the 0.72-1.22x cycle-count correlation and LLaMA layer timings are absent, which is load-bearing for assessing reproducibility and whether post-hoc choices affect the scalability claims versus SST.

    Authors: We agree these elements are necessary for reproducibility. The revised manuscript will include: explicit workload definitions (thread-block counts, layer configurations, and input sizes for the LLaMA experiments), error bars on all timing and speedup figures, selected raw timing data (in an appendix or supplementary table), and a detailed account of the hardware measurement methodology used for A100 cycle-count correlation as well as the wall-clock timings for the LLaMA-7B/13B layers. These additions will allow independent assessment of the claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance claims only

full rationale

The paper introduces ACALSim as a pluggable thread-management framework and reports direct empirical results (14x speedup, 41% memory reduction, LLaMA simulation times, 0.72-1.22x hardware correlation) against an SST baseline and A100 measurements. No equations, fitted parameters, predictions derived from prior fits, or self-citation chains appear in the provided text. The central claims rest on implementation-level timing comparisons that are externally falsifiable and do not reduce to any self-definitional or fitted-input structure. This is the expected non-finding for a systems paper whose value is in measured runtime behavior rather than any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems-engineering paper introducing a simulation framework; the abstract contains no mathematical free parameters, domain axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5873 in / 1310 out tokens · 28639 ms · 2026-05-25T05:31:59.193219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Bakhoda, G

    A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174. doi:10.1109/ISPASS.2009.4919648

  2. [2]

    J. Bang, Y. Choi, M. Kim, Y. Kim, and M. Rhu. 2024. vTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training. In2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). 153–167. doi:10.1109/MICRO61859.2024.00021 Proc. ACM Meas. Anal. Comput. Syst., Vol. 10, No. 2, Article 28. Publicatio...

  3. [3]

    I. Böhm, B. Franke, and N. Topham. 2010. Cycle-accurate performance modelling in an ultra-fast just-in-time dy- namic binary translation instruction set simulator. In2010 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation. 1–10. doi:10.1109/ICSAMOS.2010.5642102

  4. [4]

    Carothers, David Bauer, and Shawn Pearce

    Christopher D. Carothers, David Bauer, and Shawn Pearce. 2002. ROSS: A high-performance, low-memory, modular Time Warp system.J. Parallel and Distrib. Comput.62, 11 (2002), 1648–1669. doi:10.1016/S0743-7315(02)00004-7

  5. [5]

    Catania, A

    V. Catania, A. Mineo, S. Monteleone, M. Palesi, and D. Patti. 2015. Noxim: An open, extensible and cycle-accurate network on chip simulator. In2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP). 162–163. doi:10.1109/ASAP.2015.7245728

  6. [6]

    S. P. Chenna, M. Steyer, N. Kumar, M. Garzaran, and P. Thierry. 2024. Modeling and Simulation of Collective Algorithms on HPC Network Topologies using Structural Simulation Toolkit. InSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis. 909–916. doi:10.1109/SCW63240.2024.00129

  7. [7]

    J. Cho, M. Kim, H. Choi, G. Heo, and J. Park. 2024. LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale. In2024 IEEE International Symposium on Workload Characterization (IISWC). 15–29. doi:10.1109/IISWC63097.2024.00012

  8. [8]

    Choquette

    J. Choquette. 2023. NVIDIA Hopper H100 GPU: Scaling Performance.IEEE Micro43, 3 (2023), 9–17. doi:10.1109/MM. 2023.3256796

  9. [9]

    Firoozshahian et al. 2023. MTIA: First Generation Silicon Targeting Meta’s Recommendation Systems. InProceedings of the 50th Annual International Symposium on Computer Architecture (ISCA ’23). ACM, Article 80, 13 pages. doi:10. 1145/3579371.3589348

  10. [10]

    Guo et al

    Z. Guo et al. 2024. A Survey on Performance Modeling and Prediction for Distributed DNN Training.IEEE Transactions on Parallel and Distributed Systems35, 12 (2024), 2463–2478. doi:10.1109/TPDS.2024.3476390

  11. [11]

    Ham et al

    H. Ham et al. 2024. ONNXim: A Fast, Cycle-Level Multi-Core NPU Simulator.IEEE Computer Architecture Letters23, 2 (2024), 219–222. doi:10.1109/LCA.2024.3484648

  12. [12]

    Ishii et al

    A. Ishii et al. 2018. NVSwitch and DGX-2: NVLink-Switching chip and scale-up compute server. InHot Chips 30

  13. [13]

    Karandikar et al

    S. Karandikar et al. 2018. FireSim: FPGA-Accelerated Cycle-Exact Scale-Out System Simulation in the Public Cloud. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). 29–42. doi:10.1109/ISCA.2018. 00014

  14. [14]

    Khairy, Z

    M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers. 2020. Accel-sim: An extensible simulation framework for validated GPU modeling. InProc. ACM/IEEE 47th Annu. Int. Symp. Comput. Archit.473–486

  15. [15]

    H. Kwon, P. Chatarasi, V. Sarkar, T. Krishna, M. Pellauer, and A. Parashar. 2020. MAESTRO: A Data-Centric Approach to Understand Reuse, Performance, and Hardware Cost of DNN Mappings.IEEE Micro40, 3 (2020), 20–29. doi:10.1109/ MM.2020.2985963

  16. [16]

    S. Lee, A. Phanishayee, and D. Mahajan. 2025. Forecasting GPU Performance for Deep Learning Training and Inference. InProc. 30th ACM Int. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS ’25), Vol. 1. 493–508. doi:10.1145/3669940.3707265

  17. [17]

    In: 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp

    W. Luo et al. 2024. Benchmarking and Dissecting the Nvidia Hopper GPU Architecture. In2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 656–667. doi:10.1109/IPDPS57955.2024.00064

  18. [18]

    P. S. Magnusson et al. 2002. Simics: A full system simulation platform.Computer35, 2 (2002), 50–58. doi:10.1109/2.982916

  19. [19]

    Matthews et al

    O. Matthews et al. 2020. MosaicSim: A Lightweight, Modular Simulator for Heterogeneous Systems. In2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 136–148. doi:10.1109/ISPASS48437. 2020.00029

  20. [20]

    L. Mei, P. Houshmand, V. Jain, S. Giraldo, and M. Verhelst. 2021. ZigZag: Enlarging Joint Architecture-Mapping Design Space Exploration for DNN Accelerators.IEEE Trans. Comput.70, 8 (2021), 1160–1174. doi:10.1109/TC.2021.3059962

  21. [21]

    Christian Menard, Jerónimo Castrillón, Matthias Jung, and Norbert Wehn. 2017. System simulation with gem5 and SystemC: The keystone for full interoperability. In2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS). 62–69. doi:10.1109/SAMOS.2017.8344612

  22. [22]

    NVIDIA. 2024. NVIDIA DGX H100/H200 User Guide. https://docs.nvidia.com/dgx/dgxh100-user-guide/dgxh100-user- guide.pdf

  23. [23]

    Preeti Ranjan Panda. 2001. SystemC: a modeling platform supporting multiple design abstractions. InProceedings of the 14th international symposium on Systems synthesis (ISSS ’01). ACM, 75–80. doi:10.1145/500001.500018

  24. [24]

    Parashar et al

    A. Parashar et al. 2019. Timeloop: A Systematic Approach to DNN Accelerator Evaluation. In2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 304–315. doi:10.1109/ISPASS.2019.00042

  25. [25]

    A. F. Rodrigues, K. S. Hemmert, B. W. Barrett, C. Kersey, R. Oldfield, M. Weston, R. Risen, J. Cook, P. Rosenfeld, E. Cooper-Balis, and B. Jacob. 2011. The structural simulation toolkit.SIGMETRICS Perform. Eval. Rev.38, 4 (2011), 37–42. doi:10.1145/1964218.1964225

  26. [26]

    Samajdar, J

    A. Samajdar, J. M. Joseph, Y. Zhu, P. Whatmough, M. Mattina, and T. Krishna. 2020. A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim. In2020 IEEE International Symposium on Performance Proc. ACM Meas. Anal. Comput. Syst., Vol. 10, No. 2, Article 28. Publication date: June 2026. 28:24 Wei-Fen Lin et al. Analysis of...

  27. [27]

    Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: fast and accurate microarchitectural simulation of thousand-core systems. InProceedings of the 40th Annual International Symposium on Computer Architecture (ISCA ’13). ACM, 475–486. doi:10.1145/2485922.2485963

  28. [28]

    Villa et al

    O. Villa et al. 2021. Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator. In2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). 868–880. doi:10.1109/HPCA51647.2021. 00077

  29. [29]

    Wang et al

    J. Wang et al. 2014. Manifold: A parallel simulation framework for multicore systems. In2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 106–115. doi:10.1109/ISPASS.2014.6844466

  30. [30]

    W. Won, T. Heo, S. Rashidi, S. Sridharan, S. Srinivasan, and T. Krishna. 2023. ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale. In2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). 283–294. doi:10.1109/ISPASS57527.2023.00035

  31. [31]

    Tse-Chen Yeh, Zin-Yuan Lin, and Ming-Chao Chiang. 2011. Enabling TLM-2.0 interface on QEMU and SystemC-based virtual platform. In2011 IEEE International Conference on IC Design & Technology. doi:10.1109/ICICDT.2011.5783207 Received January 2026; revised March 2026; accepted April 2026 Proc. ACM Meas. Anal. Comput. Syst., Vol. 10, No. 2, Article 28. Public...