Recognition: unknown
Learning from Change: Predictive Models for Incident Prevention in a Regulated IT Environment
Pith reviewed 2026-05-10 13:49 UTC · model grok-4.3
The pith
LightGBM with aggregated team metrics outperforms rule-based systems for predicting IT change incidents while providing SHAP-based explanations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On a one-year real-world dataset of IT changes from a large international bank, LightGBM delivered the strongest performance in predicting which changes would cause incidents, particularly after the feature set was expanded with aggregated metrics describing the responsible teams and their organizational context. This data-driven approach surpassed the bank's current rule-based risk scoring, and SHAP explanations were incorporated to supply feature-level insights that support the auditability and explainability demanded by regulators.
What carries the argument
LightGBM classifier trained on change attributes plus aggregated team metrics, with SHAP values supplying per-prediction feature contributions for traceability.
If this is right
- Engineers can receive risk scores early in the assessment and planning stages to adjust or defer high-risk changes before deployment.
- IT operations can shift from post-incident response toward proactive mitigation of change-related disruptions.
- Regulated organizations gain a compliant way to document and justify risk decisions through feature-level explanations.
- Overall reliability of software services improves when data patterns replace static rules for change evaluation.
Where Pith is reading between the lines
- The same modeling approach could be tested in other regulated domains where changes affect critical infrastructure.
- Organizational context captured by team metrics suggests that technical features alone miss important risk drivers.
- Embedding the model directly into change-management tooling would allow real-time risk scoring during planning.
- Periodic retraining on newer data would be needed to keep predictions aligned with evolving team structures and processes.
Load-bearing premise
The one-year dataset from this single bank captures stable patterns that will continue to hold for future changes, and SHAP values meet all regulatory standards for auditability and explainability.
What would settle it
Running the trained LightGBM model on a fresh set of changes from the same bank and observing that its incident predictions are no more accurate than the existing rule-based system, or that an audit rejects the SHAP explanations as insufficiently transparent.
Figures
read the original abstract
Effective IT change management is important for businesses that depend on software and services, particularly in highly regulated sectors such as finance, where operational reliability, auditability, and explainability are essential. A significant portion of IT incidents are caused by changes, making it important to identify high-risk changes before deployment. This study presents a predictive incident risk scoring approach at a large international bank. The approach supports engineers during the assessment and planning phases of change deployments by predicting the potential of inducing incidents. To satisfy regulatory constraints, we built the model with auditability and explainability in mind, applying SHAP values to provide feature-level insights and ensure decisions are traceable and transparent. Using a one-year real-world dataset, we compare the existing rule-based process with three machine learning models: HGBC, LightGBM, and XGBoost. LightGBM achieved the best performance, particularly when enriched with aggregated team metrics that capture organisational context. Our results show that data-driven, interpretable models can outperform rule-based approaches while meeting compliance needs, enabling proactive risk mitigation and more reliable IT operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a predictive model for incident risk from IT changes at a large international bank, comparing an existing rule-based process against three ML models (HGBC, LightGBM, XGBoost) on a one-year real-world dataset. LightGBM is reported to achieve the best performance, especially after enrichment with aggregated team metrics capturing organizational context, and SHAP values are used to provide feature-level explanations for regulatory auditability and explainability.
Significance. If the empirical results prove robust, the work offers a concrete demonstration that incorporating team-level organizational features can improve risk prediction over rule-based baselines in regulated IT environments while satisfying compliance needs through interpretable models. This could support more proactive change management practices in finance and similar sectors.
major comments (2)
- [Results] Results section: The abstract reports that LightGBM outperforms the rule-based process and other models, but provides no details on data splits (e.g., temporal train/test partitioning), feature engineering steps for the team aggregates, hyperparameter tuning, or statistical significance testing. These omissions are load-bearing for the central claim of superiority, as post-hoc choices could inflate the reported gains.
- [Methods] Methods section: The claim that SHAP values satisfy regulatory requirements for auditability is asserted without concrete mapping to specific compliance criteria or discussion of how attributions remain stable under the addition of team-level aggregates.
minor comments (1)
- [Abstract] The acronym HGBC is used without expansion; define it on first use (likely Histogram-based Gradient Boosting Classifier).
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency will strengthen the paper's claims regarding model superiority and regulatory compliance. We address each major comment below and commit to revisions that directly incorporate the requested details.
read point-by-point responses
-
Referee: [Results] Results section: The abstract reports that LightGBM outperforms the rule-based process and other models, but provides no details on data splits (e.g., temporal train/test partitioning), feature engineering steps for the team aggregates, hyperparameter tuning, or statistical significance testing. These omissions are load-bearing for the central claim of superiority, as post-hoc choices could inflate the reported gains.
Authors: We agree that these details are essential to substantiate the performance claims and support reproducibility. In the revised manuscript, we will expand the Methods and Results sections with: (1) a clear description of the temporal train/test partitioning (first 9 months for training, final 3 months for testing, chosen to reflect forward-looking deployment and minimize leakage); (2) explicit feature engineering steps for team aggregates, including the aggregation functions (e.g., mean, count, variance over 30/90-day windows), normalization, and handling of missing values; (3) the hyperparameter tuning procedure (Bayesian optimization via Optuna with 5-fold time-series cross-validation); and (4) statistical significance testing (DeLong test for AUC differences and McNemar's test for classification outcomes, with p-values reported). These additions will allow readers to evaluate whether the reported gains are robust. revision: yes
-
Referee: [Methods] Methods section: The claim that SHAP values satisfy regulatory requirements for auditability is asserted without concrete mapping to specific compliance criteria or discussion of how attributions remain stable under the addition of team-level aggregates.
Authors: We acknowledge that the current text asserts regulatory suitability at a high level without sufficient grounding. We will revise the Methods section to provide a concrete mapping: SHAP attributions enable per-change traceability (supporting audit log requirements under frameworks such as Basel III operational risk guidelines and internal model governance), while feature-level explanations address explainability mandates for high-impact decisions. We will also add a dedicated subsection analyzing attribution stability, including quantitative comparison of SHAP values and global feature rankings before versus after adding team aggregates, plus qualitative discussion of any shifts in top contributors. This will demonstrate that the organizational features do not destabilize the explanations. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper is an empirical ML study that trains and evaluates LightGBM, XGBoost, and HGBC models on a held-out portion of a one-year bank dataset, comparing them to an existing rule-based baseline while using SHAP for post-hoc explainability. No load-bearing step reduces by construction to a fitted parameter reused as a prediction, a self-defined quantity, or a self-citation chain that substitutes for independent evidence. The central performance claim rests on direct comparison against external baselines and the same dataset's temporal split, satisfying the criteria for a self-contained empirical result without circular reduction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Historical IT change records paired with subsequent incidents form a representative training distribution for future changes.
- domain assumption SHAP values provide sufficient feature-level explanations to satisfy audit and regulatory requirements.
Reference graph
Works this paper leans on
-
[1]
2022-12-27. Regulation (EU) 2022/2554 of the European Parliament and of the Council of 14 December 2022 on digital operational resilience for the financial sector and amending Regulations (EC) No 1060/2009, (EU) No 648/2012, (EU) No 600/2014, (EU) No 909/2014 and (EU) 2016/1011.OJL 333 (2022-12-27), 1–79. https://eur-lex.europa.eu/eli/reg/2022/2554/oj
2022
-
[2]
Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: a survey on explainable artificial intelligence (XAI).IEEE access6 (2018), 52138– 52160
2018
-
[3]
Salman Ahmed, Muskaan Singh, Brendan Doherty, Effirul Ramlan, Kathryn Harkin, Magda Bucholc, and Damien Coyle. 2023. An empirical analysis of state-of-art classification models in an it incident severity prediction framework. Applied Sciences13, 6 (2023), 3843
2023
-
[4]
Gergő Barta. 2018. The increasing role of IT auditors in financial audit: risks and intelligent answers.Business, Management and Education16, 1 (2018), 81–93
2018
-
[5]
Raghav Batta, Larisa Shwartz, Michael Nidd, Amar Prakash Azad, and Harshit Kumar. 2021. A system for proactive risk assessment of application changes in cloud operations. In2021 IEEE 14th International Conference on Cloud Computing (CLOUD). 112–123. doi:10.1109/CLOUD53861.2021.00025
-
[6]
Kent Beck, Mike Beedle, Arie Van Bennekum, Alistair Cockburn, Ward Cunning- ham, Martin Fowler, James Grenning, Jim Highsmith, Andrew Hunt, Ron Jeffries, et al. 2001. The agile manifesto
2001
-
[7]
O’Reilly Media, Inc
Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016.Site reliability engineering: How Google runs production systems. " O’Reilly Media, Inc. "
2016
-
[8]
Junjie Chen, Shu Zhang, Xiaoting He, Qingwei Lin, Hongyu Zhang, Dan Hao, Yu Kang, Feng Gao, Zhangwei Xu, Yingnong Dang, and Dongmei Zhang. 2021. How incidental are the incidents? characterizing and prioritizing incidents for large- scale online service systems. InProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering(V...
-
[9]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(San Francisco, California, USA)(KDD ’16). Association for Computing Machinery, New York, NY, USA, 785–794. doi:10. 1145/2939672.2939785
-
[10]
European Commission. 2024. The act texts. https://artificialintelligenceact.eu/ the-act/
2024
-
[11]
EBA. 2021. Revised Guidelines on major incident reporting under PSD2. https://www.eba.europa.eu/regulation-and-policy/payment-services- and-electronic-money/guidelines-on-major-incidents-reporting-under-psd2
2021
-
[12]
IBM Cloud Education. 2019. IT Infrastructure Library (ITIL). https://www.ibm. com/cloud/learn/it-infrastructure-library
2019
-
[13]
Stephen Elliot. 2014. DevOps and the cost of downtime: Fortune 1000 best practice metrics quantified.International Data Corporation (IDC)(2014)
2014
-
[14]
2018.Accelerate
Nicole Forsgren, Jez Humble, and Gene Kim. 2018.Accelerate. IT Revolution, Portland, OR
2018
-
[15]
Supriyo Ghosh, Manish Shetty, Chetan Bansal, and Suman Nath. 2022. How to fight production incidents? an empirical study on a large-scale cloud service. In Proceedings of the 13th Symposium on Cloud Computing(San Francisco, California) (SoCC ’22). Association for Computing Machinery, New York, NY, USA, 126–141. doi:10.1145/3542929.3563482
-
[16]
Sinem Güven and Karin Murthy. 2016. Understanding the role of change in incident prevention. In2016 12th International Conference on Network and Service Management (CNSM). 268–271. doi:10.1109/CNSM.2016.7818430
-
[17]
Sinem Güven, Karin Murthy, Larisa Shwartz, and Amit Paradkar. 2016. Towards establishing causality between change and incident. InNOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium. 937–942. doi:10.1109/NOMS. 2016.7502929
-
[18]
Zhe Hui Hoo, Jane Candlish, and Dawn Teare. 2017. What is an ROC curve?Emergency Medicine Journal34, 6 (2017), 357–359. arXiv:https://emj.bmj.com/content/34/6/357.full.pdf doi:10.1136/emermed-2017- 206735
- [19]
-
[20]
Nathalie Japkowicz and Shaju Stephen. 2002. The class imbalance problem: A systematic study.Intelligent data analysis6, 5 (2002), 429–449
2002
-
[21]
Eileen Kapel, Luis Cruz, Diomidis Spinellis, and Arie van Deursen. 2024. En- hancing Incident Management: Insights from a Case Study at ING. InProceedings of the 1st IEEE/ACM Workshop on Software Engineering Challenges in Financial Firms(Lisbon, Portugal)(FinanSE ’24). Association for Computing Machinery, New York, NY, USA, 1–8. doi:10.1145/3643665.3648048
-
[22]
Eileen Kapel, Luis Cruz, Diomidis Spinellis, and Arie Van Deursen. 2024. On the Difficulty of Identifying Incident-Inducing Changes. InProceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (Lisbon, Portugal)(ICSE-SEIP ’24). Association for Computing Machinery, New York, NY, USA, 36–46. doi:10.1145/36394...
-
[23]
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. 2017. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. InAdvances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https...
2017
-
[24]
Ze Li, Qian Cheng, Ken Hsieh, Yingnong Dang, Peng Huang, Pankaj Singh, Xin- sheng Yang, Qingwei Lin, Youjiang Wu, Sebastien Levy, and Murali Chintalapati
-
[25]
In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20)
Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20). USENIX Association, Santa Clara, CA, 389–402. https://www.usenix.org/conference/nsdi20/presentation/li
-
[26]
Scott Lundberg. 2017. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874(2017)
work page Pith review arXiv 2017
-
[27]
Scott Lundberg. 2021. Be careful when interpreting predictive models in search of causal insights. https://medium.com/data-science/be-careful-when- interpreting-predictive-models-in-search-of-causal-insights-e68626e664b6
2021
-
[28]
Scott M Lundberg, Gabriel G Erion, and Su-In Lee. 2018. Consistent individualized feature attribution for tree ensembles.arXiv preprint arXiv:1802.03888(2018)
work page Pith review arXiv 2018
-
[29]
Adya Mishra. 2019. Exploring ITIL and ITSM Change Management in Highly Regulated Industries: A Review of Best Practices and Challenges.International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences 7 (10 2019), 1–8. doi:10.5281/zenodo.14838584
- [30]
-
[31]
Edoardo Mosca, Ferenc Szigeti, Stella Tragianni, Daniel Gallagher, and Georg Groh. 2022. SHAP-based explanation methods: a review for NLP interpretability. InProceedings of the 29th international conference on computational linguistics. 4593–4603
2022
-
[32]
Francis Sahngun Nahm. 2022. Receiver operating characteristic curve: overview and practical use for clinicians.Korean journal of anesthesiology75, 1 (2022), 25–36
2022
-
[33]
2016.Tree boosting with xgboost-why does xgboost win" every" machine learning competition?Master’s thesis
Didrik Nielsen. 2016.Tree boosting with xgboost-why does xgboost win" every" machine learning competition?Master’s thesis. NTNU
2016
-
[34]
European Parliament and Council of the European Union. 2018. https://www. eba.europa.eu/about-us
2018
-
[35]
2019-11-29
European Parliament and Council of the European Union. 2019-11-29. Directive (EU) EBA/GL/2019/04 of the European Parliament and of the Council of 29 November 2019 on EBA Guidelines on ICT and secu- rity risk management, repealing Directive EBA/GL/2017/17.OJ(2019-11- 29). https://www.eba.europa.eu/regulation-and-policy/internal-governance/ guidelines-on-ic...
2019
-
[36]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. Scikit-learn: Machine learning in Python.Journal of machine learning research12, Oct (2011), 2825–2830
2011
- [37]
-
[38]
Tom-Martijn Roelofs, Eduardo Barbaro, Svetlana Pekarskikh, Katarzyna Orze- chowska, Marta Kwapień, Jakub Tyrlik, Dinu Smadu, Michel Van Eeten, and Yury Zhauniarovich. 2024. Finding harmony in the noise: Blending security alerts for attack detection. InProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing. 1385–1394
2024
- [39]
-
[40]
ServiceNow. [n. d.]. What is ServiceNow? https://www.servicenow.com/what- is-servicenow.html
-
[41]
ServiceNow. 2024. Change Success Score Product Documentation Release Washington DC. https://docs.servicenow.com/bundle/washingtondc-it-service- management/page/product/change-management/concept/change-success- score.html
2024
-
[42]
United States Code. 2002. Sarbanes-Oxley Act of 2002, PL 107-204, 116 Stat 745. Codified in Sections 11, 15, 18, 28, and 29 USC
2002
-
[43]
Xiaohan Yan, Ken Hsieh, Yasitha Liyanage, Minghua Ma, Murali Chintalapati, Qingwei Lin, Yingnong Dang, and Dongmei Zhang. 2023. Aegis: Attribution of Control Plane Change Impact across Layers and Components for Cloud Systems. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 222–233. doi:1...
-
[44]
2009.Case study research: Design and methods
Robert K Yin. 2009.Case study research: Design and methods. Vol. 5. sage
2009
-
[45]
Ennan Zhai, Ang Chen, Ruzica Piskac, Mahesh Balakrishnan, Bingchuan Tian, Bo Song, and Haoliang Zhang. 2020. Check before You Change: Preventing ICSE-SEIP ’26, April 12–18, 2026, Rio de Janeiro, Brazil Eileen Kapel, Jan Lennartz, Luis Cruz, Diomidis Spinellis, and Arie van Deursen Correlated Failures in Service Updates. In17th USENIX Symposium on Networke...
2020
-
[46]
Nengwen Zhao, Junjie Chen, Zhaoyang Yu, Honglin Wang, Jiesong Li, Bin Qiu, Hongyu Xu, Wenchi Zhang, Kaixin Sui, and Dan Pei. 2021. Identifying bad software changes via multimodal anomaly detection for online service systems. InProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software...
-
[47]
Yujin Zhao, Ling Jiang, Ye Tao, Songlin Zhang, Changlong Wu, Tong Jia, Xiaosong Huang, Ying Li, and Zhonghai Wu. 2023. Identifying Root-Cause Changes for User-Reported Incidents in Online Service Systems. In2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). 287–297. doi:10.1109/ ISSRE59848.2023.00028
-
[48]
Quan Zou, Sifa Xie, Ziyu Lin, Meihong Wu, and Ying Ju. 2016. Finding the Best Classification Threshold in Imbalanced Classification.Big Data Research5 (2016), 2–8. doi:10.1016/j.bdr.2015.12.001 Big data analytics and applications. Received 29 September 2025; revised 8 December 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.