AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

Fred Lin; Karthik Pattabiraman; Mihir Patel

arxiv: 2603.07041 · v3 · submitted 2026-03-07 · 💻 cs.DC

AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

Karthik Pattabiraman , Mihir Patel , Fred Lin This is my paper

Pith reviewed 2026-05-15 15:37 UTC · model grok-4.3

classification 💻 cs.DC

keywords discrete event simulationAI cluster reliabilityfailure recoveryschedulingcapacity planninglarge-scale computingwhat-if analysis

0 comments

The pith

AIReSim is a discrete event simulator for evaluating failure, recovery, scheduling, and repair design choices in large-scale AI clusters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AIReSim as a discrete event simulator to model failure, recovery, scheduling, and repair processes in clusters running large AI workloads. Failures are costly in these systems because they force full job restarts from checkpoints, creating many tunable mechanisms to reduce impact. The simulator lets designers test how different parameter choices affect overall end-to-end reliability and identify which ones matter most for improvement efforts. It also supports tuning parameters for specific tradeoffs and running what-if analyses, illustrated through a capacity planning case study for large AI clusters.

Core claim

The authors built AIReSim to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various what-if scenarios. We present a case study of applying AIReSim for capacity plan

What carries the argument

Discrete event simulator that models the timing and interactions of failures, recoveries, scheduling decisions, and repairs.

Load-bearing premise

The discrete-event model accurately represents the timing and interactions of real failures, recoveries, and scheduling decisions in production AI clusters.

What would settle it

Running AIReSim with parameters drawn from a real production AI cluster and checking whether its predicted utilization, downtime, and recovery times match the observed values in that cluster.

Figures

Figures reproduced from arXiv: 2603.07041 by Fred Lin, Karthik Pattabiraman, Mihir Patel.

**Figure 2.** Figure 2: Graphs of the total Training Time in hours Vs. (a) [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Failures in clusters running large-scale AI workloads can result in decreased utilization. Because the cost of a failure in such AI workloads is high (as it requires restarting the entire job from a previous checkpoint), there are many mechanisms in place to ensure that the failures are mitigated, and the impact of a failure is minimized. However, these mechanisms have many knobs and parameters, all of which must be carefully tuned based on the system and cluster's characteristics. We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes for a cluster running a large-scale AI workload. AIReSim allows the system designer to systematically evaluate the effects of the different knobs and parameters on the overall end-to-end reliability of the system. Further, AIReSim can be used to identify which knobs or parameters are important in order to prioritize the investment of effort in improving the system. AIReSim also allows tuning of the knobs for achieving different tradeoffs in the system, as well as to consider various ``what-if'' scenarios. We present a case study of applying AIReSim for capacity planning for large-scale clusters running AI workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AIReSim names a discrete-event simulator for AI cluster reliability but supplies no validation, parameters, or results to show it matches reality.

read the letter

The paper introduces AIReSim, a discrete-event simulator meant to let designers test failure, recovery, scheduling, and repair choices in large AI training clusters where a single failure can force a full job restart from checkpoint. The core idea is practical: these clusters are costly, so systematic what-if analysis on the knobs could help with utilization and capacity planning. The case study framing shows they have real scenarios in mind for prioritizing improvements and exploring tradeoffs. That part is straightforward and addresses a genuine pain point for operators running big AI jobs. Standard discrete-event techniques get applied here with focus on restart costs, which is a reasonable incremental step rather than a new modeling framework. The soft spot is the complete lack of grounding. The text gives no failure distributions, repair models, checkpoint overhead details, or any comparison of simulated outputs to production traces or metrics like lost utilization or MTTR. Without calibration data or validation experiments, the outputs remain unanchored and cannot reliably guide design decisions. This leaves the central claim—that the simulator enables useful evaluation—resting on an untested assumption about model fidelity. The work is aimed at systems researchers and cloud engineers who tune reliability for AI workloads. A reader already building similar tools might pick up the problem framing, but anyone expecting concrete findings or reusable code will come away empty. I would send it for peer review only if the authors add validation sections and quantitative results; on the current description alone it is too preliminary to assess properly.

Referee Report

2 major / 1 minor

Summary. The paper introduces AIReSim, a discrete-event simulator for evaluating failure, recovery, scheduling, and repair mechanisms in large-scale AI clusters. It claims the tool enables systematic knob tuning, identification of important parameters, tradeoff analysis, and what-if scenarios, with a case study demonstrating its application to capacity planning for AI workloads.

Significance. If the underlying models are shown to be faithful to production AI cluster dynamics, AIReSim could help designers prioritize reliability investments and optimize utilization under high failure costs. The work addresses a timely problem in AI infrastructure, but its value hinges on whether simulated outputs predict real behavior.

major comments (2)

[Abstract] Abstract and model description: The central claim that AIReSim produces actionable evaluations of design choices requires the discrete-event timing models (failure arrivals, checkpoint overheads, repair latencies, scheduling interactions) to be predictive. No specification of event types, stochastic distributions, state machines, or input parameters is supplied, so the fidelity of any case-study outputs cannot be assessed.
[Case Study] Case study section: Results on knob importance and capacity-planning what-if scenarios are presented without any calibration to real traces, comparison of simulated vs. observed metrics (e.g., utilization loss or MTTR), or sensitivity analysis to modeling assumptions. This leaves the quantitative findings unanchored and undermines their use for guiding production decisions.

minor comments (1)

[Abstract] The abstract would benefit from one sentence summarizing the simulator architecture or key abstractions used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract] Abstract and model description: The central claim that AIReSim produces actionable evaluations of design choices requires the discrete-event timing models (failure arrivals, checkpoint overheads, repair latencies, scheduling interactions) to be predictive. No specification of event types, stochastic distributions, state machines, or input parameters is supplied, so the fidelity of any case-study outputs cannot be assessed.

Authors: We agree that the manuscript must supply explicit model details for the central claim to be evaluable. We will add a dedicated modeling section that defines all event types (failure, checkpoint, repair, scheduling), the stochastic distributions (e.g., exponential or Weibull for failures drawn from published AI-cluster studies), node/job state machines, and the complete set of input parameters with defaults and ranges. This addition will allow readers to assess the predictive fidelity of the case-study results. revision: yes
Referee: [Case Study] Case study section: Results on knob importance and capacity-planning what-if scenarios are presented without any calibration to real traces, comparison of simulated vs. observed metrics (e.g., utilization loss or MTTR), or sensitivity analysis to modeling assumptions. This leaves the quantitative findings unanchored and undermines their use for guiding production decisions.

Authors: The case study is presented as an illustrative demonstration of AIReSim's use for capacity planning and knob tuning, not as a validated prediction for a specific production deployment. We will revise the section to add sensitivity analysis over key parameters (failure rate, checkpoint interval, repair latency) and an explicit discussion of modeling assumptions and limitations. Direct calibration against proprietary production traces is not possible within the scope of this work; we will cite the relevant literature used for parameter selection and identify calibration as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: simulator tool description with no derivation chain

full rationale

The paper describes the construction and use of AIReSim, a discrete-event simulator for AI cluster reliability modeling, including failure, recovery, scheduling, and repair processes. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central claims rest on the simulator's ability to evaluate design knobs and support what-if scenarios in a case study, without any self-referential reduction where outputs are defined by or fitted to the same inputs. Self-citations are absent from the load-bearing elements, and the work is self-contained as a tool-building contribution. Absence of empirical validation data is a model-fidelity concern, not a circularity issue per the evaluation criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that discrete-event modeling suffices to capture the relevant dynamics of AI workload failures and recoveries.

axioms (1)

domain assumption Failure, recovery, and repair processes in AI clusters can be represented as discrete events with tunable parameters.
Invoked throughout the abstract as the basis for evaluating design choices.

pith-pipeline@v0.9.0 · 5512 in / 1146 out tokens · 38880 ms · 2026-05-15T15:37:34.602704+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We built AIReSim, a discrete event simulator to evaluate the different design choices during the failure, recovery, scheduling and repair processes... AIReSim takes the following parameters as inputs... Failure rate... Repair times... Repair failure probability...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use AIReSim to determine the values of different parameters and knobs... perform a parameter sweep over the set of knobs to understand the sensitivity

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 2 internal anchors

[1]

ChatGPT,

OpenAI, “ChatGPT,” https://chat.openai.com/, 2025, accessed: 2025-01- 23

work page 2025
[2]

Mishra, J

A. Mishra, J. Cha, H. Park, and S. Kim,Artificial intelligence and hardware accelerators. Springer, 2023

work page 2023
[3]

Revisiting reliabil- ity in large-scale machine learning research clusters,

A. Kokolis, M. Kuchnik, J. Hoffman, A. Kumar, P. Malani, F. Ma, Z. DeVito, S. Sengupta, K. Saladi, and C.-J. Wu, “Revisiting reliabil- ity in large-scale machine learning research clusters,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1259–1274

work page 2025
[4]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Specinf: Exploiting idle gpu resources in distributed dl training via speculative inference filling,

C. Lv, X. Shi, D. Liang, W. Tan, and X. Zhao, “Specinf: Exploiting idle gpu resources in distributed dl training via speculative inference filling,” inNetwork and Parallel Computing: 20th IFIP WG 10.3 International Conference, NPC 2024, Haikou, China, December 7–8, 2024, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag, 2025, p. 146–158. [Online]....

work page doi:10.1007/978-981-96- 2024
[6]

Silent data corruptions at scale

H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar, “Silent data corruptions at scale,”arXiv preprint arXiv:2102.11245, 2021

work page arXiv 2021
[7]

Silent data corruption by 10× test escapes threatens reliable computing,

S. Mitra, S. S. Banerjee, M. Dixon, M. Fuller, R. Govindaraju, P. Hochschild, E. X. Liu, B. Parthasarathy, and P. Ranganathan, “Silent data corruption by 10× test escapes threatens reliable computing,”IEEE Design and Test, vol. 42, no. 6, pp. 40–53, 2025

work page 2025
[8]

K. S. Trivedi,Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001

work page 2001
[9]

G. S. Fishman,Discrete-event simulation: modeling, programming, and analysis. Springer, 2001, vol. 537

work page 2001
[10]

Detecting silent data corruptions in the wild,

H. D. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, and S. Sankar, “Detecting silent data corruptions in the wild,” 2022. [Online]. Available: https://arxiv.org/abs/2203.08989

work page arXiv 2022
[11]

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery,

E. N. Elnozahy and J. S. Plank, “Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery,”IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 2, pp. 97–108, 2004

work page 2004
[12]

Large-scale ai infra reliability: Challenges, strategies, and llama 3 training experience,

X. Jiao, A. Pandey, K. Pattabiraman, and F. Lin, “Large-scale ai infra reliability: Challenges, strategies, and llama 3 training experience,” in 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), 2025, pp. 140– 146

work page 2025
[13]

Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, June 23, 75–86

Z. Jiang, J. Huang, G. Yu, Z. Chen, Y . Li, R. Zhong, C. Feng, Y . Yang, Z. Yang, and M. Lyu,L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis. New York, NY , USA: Association for Computing Machinery, 2025, p. 51–63. [Online]. Available: https://doi.org/10.1145/3696630.3728531

work page doi:10.1145/3696630.3728531 2025
[14]

Hardware remediation at scale,

F. Lin, M. Beadon, H. D. Dixit, G. Vunnam, A. Desai, and S. Sankar, “Hardware remediation at scale,” in2018 48th Annual IEEE/IFIP Inter- national Conference on Dependable Systems and Networks Workshops (DSN-W), 2018, pp. 14–17

work page 2018
[15]

Sharpe at the age of twenty two,

K. S. Trivedi and R. Sahner, “Sharpe at the age of twenty two,” SIGMETRICS Perform. Eval. Rev., vol. 36, no. 4, p. 52–57, Mar. 2009. [Online]. Available: https://doi.org/10.1145/1530873.1530884

work page doi:10.1145/1530873.1530884 2009
[16]

A/b testing: A systematic literature review,

F. Quin, D. Weyns, M. Galster, and C. C. Silva, “A/b testing: A systematic literature review,”Journal of Systems and Software, vol. 211, p. 112011, 2024

work page 2024
[17]

SimGrid: a Sustained Effort for the Versatile Simulation of Large Scale Distributed Systems

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Simgrid: a sustained effort for the versatile simulation of large scale distributed systems,” 2013. [Online]. Available: https://arxiv.org/abs/ 1309.1630

work page internal anchor Pith review Pith/arXiv arXiv 2013
[18]

Scherfke, O

S. Scherfke, O. L ¨unsdorf, P. Grayson, E. LaFevers, T. Pinckney, C. Klein, S. Vaidya, L. Reis, S. Reed, Z. Liuet al., “Simpy,”URL https://github. com/simpx/simpy, 2021

work page 2021

[1] [1]

ChatGPT,

OpenAI, “ChatGPT,” https://chat.openai.com/, 2025, accessed: 2025-01- 23

work page 2025

[2] [2]

Mishra, J

A. Mishra, J. Cha, H. Park, and S. Kim,Artificial intelligence and hardware accelerators. Springer, 2023

work page 2023

[3] [3]

Revisiting reliabil- ity in large-scale machine learning research clusters,

A. Kokolis, M. Kuchnik, J. Hoffman, A. Kumar, P. Malani, F. Ma, Z. DeVito, S. Sengupta, K. Saladi, and C.-J. Wu, “Revisiting reliabil- ity in large-scale machine learning research clusters,” in2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025, pp. 1259–1274

work page 2025

[4] [4]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Specinf: Exploiting idle gpu resources in distributed dl training via speculative inference filling,

C. Lv, X. Shi, D. Liang, W. Tan, and X. Zhao, “Specinf: Exploiting idle gpu resources in distributed dl training via speculative inference filling,” inNetwork and Parallel Computing: 20th IFIP WG 10.3 International Conference, NPC 2024, Haikou, China, December 7–8, 2024, Proceedings, Part I. Berlin, Heidelberg: Springer-Verlag, 2025, p. 146–158. [Online]....

work page doi:10.1007/978-981-96- 2024

[6] [6]

Silent data corruptions at scale

H. D. Dixit, S. Pendharkar, M. Beadon, C. Mason, T. Chakravarthy, B. Muthiah, and S. Sankar, “Silent data corruptions at scale,”arXiv preprint arXiv:2102.11245, 2021

work page arXiv 2021

[7] [7]

Silent data corruption by 10× test escapes threatens reliable computing,

S. Mitra, S. S. Banerjee, M. Dixon, M. Fuller, R. Govindaraju, P. Hochschild, E. X. Liu, B. Parthasarathy, and P. Ranganathan, “Silent data corruption by 10× test escapes threatens reliable computing,”IEEE Design and Test, vol. 42, no. 6, pp. 40–53, 2025

work page 2025

[8] [8]

K. S. Trivedi,Probability and statistics with reliability, queuing, and computer science applications. John Wiley & Sons, 2001

work page 2001

[9] [9]

G. S. Fishman,Discrete-event simulation: modeling, programming, and analysis. Springer, 2001, vol. 537

work page 2001

[10] [10]

Detecting silent data corruptions in the wild,

H. D. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, and S. Sankar, “Detecting silent data corruptions in the wild,” 2022. [Online]. Available: https://arxiv.org/abs/2203.08989

work page arXiv 2022

[11] [11]

Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery,

E. N. Elnozahy and J. S. Plank, “Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery,”IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 2, pp. 97–108, 2004

work page 2004

[12] [12]

Large-scale ai infra reliability: Challenges, strategies, and llama 3 training experience,

X. Jiao, A. Pandey, K. Pattabiraman, and F. Lin, “Large-scale ai infra reliability: Challenges, strategies, and llama 3 training experience,” in 2025 55th Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S), 2025, pp. 140– 146

work page 2025

[13] [13]

Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, June 23, 75–86

Z. Jiang, J. Huang, G. Yu, Z. Chen, Y . Li, R. Zhong, C. Feng, Y . Yang, Z. Yang, and M. Lyu,L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis. New York, NY , USA: Association for Computing Machinery, 2025, p. 51–63. [Online]. Available: https://doi.org/10.1145/3696630.3728531

work page doi:10.1145/3696630.3728531 2025

[14] [14]

Hardware remediation at scale,

F. Lin, M. Beadon, H. D. Dixit, G. Vunnam, A. Desai, and S. Sankar, “Hardware remediation at scale,” in2018 48th Annual IEEE/IFIP Inter- national Conference on Dependable Systems and Networks Workshops (DSN-W), 2018, pp. 14–17

work page 2018

[15] [15]

Sharpe at the age of twenty two,

K. S. Trivedi and R. Sahner, “Sharpe at the age of twenty two,” SIGMETRICS Perform. Eval. Rev., vol. 36, no. 4, p. 52–57, Mar. 2009. [Online]. Available: https://doi.org/10.1145/1530873.1530884

work page doi:10.1145/1530873.1530884 2009

[16] [16]

A/b testing: A systematic literature review,

F. Quin, D. Weyns, M. Galster, and C. C. Silva, “A/b testing: A systematic literature review,”Journal of Systems and Software, vol. 211, p. 112011, 2024

work page 2024

[17] [17]

SimGrid: a Sustained Effort for the Versatile Simulation of Large Scale Distributed Systems

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, “Simgrid: a sustained effort for the versatile simulation of large scale distributed systems,” 2013. [Online]. Available: https://arxiv.org/abs/ 1309.1630

work page internal anchor Pith review Pith/arXiv arXiv 2013

[18] [18]

Scherfke, O

S. Scherfke, O. L ¨unsdorf, P. Grayson, E. LaFevers, T. Pinckney, C. Klein, S. Vaidya, L. Reis, S. Reed, Z. Liuet al., “Simpy,”URL https://github. com/simpx/simpy, 2021

work page 2021