Forecasting the Maintained Score from the OpenSSF Scorecard: A Study of GitHub Repositories Linked to PyPI Packages
Pith reviewed 2026-05-16 10:52 UTC · model grok-4.3
The pith
Future maintenance activity in open-source GitHub repositories can be forecasted from historical OpenSSF Maintained scores with meaningful accuracy using aggregated targets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that future maintenance activity as captured by the OpenSSF Maintained score can be forecasted with meaningful accuracy from historical data, particularly when using aggregated representations such as bucketed scores and trend types that reach accuracies above 0.95 and 0.79, and that a standard Random Forest model performs at least as well as an LSTM for this task.
What carries the argument
The central mechanism is the multivariate time series forecasting setup that converts historical Maintained scores into four target representations—raw numerical value, low-moderate-high buckets, numerical slope, and categorical trend direction—then applies Random Forest and LSTM models on 3-12 month training windows to predict 1-6 month horizons.
If this is right
- Developers and organizations could use forecasts to identify likely future maintenance drops and prioritize dependency reviews.
- Bucketed and categorical trend targets deliver substantially higher accuracy than raw numerical scores.
- Standard Random Forest models suffice and remove the need for complex deep learning architectures in this domain.
- Forecasts remain useful across horizons of one to six months when aggregated representations are used.
Where Pith is reading between the lines
- Dependency management tools could incorporate these forecasts to trigger automated alerts for packages at risk of abandonment.
- The same reconstruction-and-forecast pipeline might extend to other OpenSSF Scorecard metrics such as security practices or licensing.
- Similar time-series approaches could apply to repositories in additional ecosystems such as npm or Maven without major redesign.
- Accuracy would likely improve if models incorporated additional signals like commit frequency or contributor counts beyond the Maintained score alone.
Load-bearing premise
The historical Maintained scores reconstructed over three years from GitHub activity logs are accurate and representative enough to serve as reliable training targets.
What would settle it
An independent replication on a new collection of repositories that finds bucketed-score forecasting accuracy below 0.8 would falsify the claim of meaningful predictive performance.
Figures
read the original abstract
Background: The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric serving as a key indicator of recent maintenance activities, helping users identify actively maintained projects and potentially abandoned dependencies. However, the metric is inherently retrospective, providing only a short-term snapshot based on the past 90 days of repository activity and offering no insight into the future. This limitation complicates risk assessment for developers and organizations that rely on open-source dependencies. Aims: In this paper, we investigate the feasibility of forecasting future maintenance activities as captured by the OpenSSF Maintained score. Method: Focusing on 3,220 GitHub repositories linked to one of the top 1% most central PyPI libraries, as ranked by PageRank, we reconstruct historical Maintained scores over a three-year period and frame the problem as a multivariate time series forecasting task. We study four target representations: the raw Maintained score (0-10), a bucketed score capturing low (0-2), moderate (3-7), and high (8-10) maintenance levels, the numerical trend slope between consecutive scores, and categorical trend types (downward, stable, upward). We compare a machine learning model (Random Forest) and a deep learning model (LSTM) using training windows of 3-12 months and forecasting horizons of 1-6 months. Results: Our results show that future maintenance activity can be forecasted with meaningful accuracy, particularly when using aggregated representations such as bucketed scores and trend types leading to accuracies above 0.95 and 0.79. Notably, simpler machine learning models perform at least on par with deep learning approaches, suggesting that effective forecasting does not require complex architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study forecasting future OpenSSF Maintained scores (0-10 scale) for 3,220 GitHub repositories linked to top-1% PyPI packages by PageRank. Historical scores are reconstructed over three years from GitHub activity data and the problem is framed as multivariate time-series forecasting. Four target representations are studied (raw score, bucketed low/moderate/high, numerical trend slope, categorical trend type) using Random Forest and LSTM models across training windows of 3-12 months and horizons of 1-6 months. The central claim is that meaningful accuracy is achievable, especially with aggregated representations (bucketed scores >0.95, trend types >0.79), and that simpler ML models perform at least as well as deep learning.
Significance. If the forecasting results hold under validated targets, the work has clear practical value for software supply-chain risk assessment by shifting the Maintained metric from purely retrospective to predictive. The large-scale real-world dataset, systematic comparison of target encodings and model families, and the finding that Random Forest matches or exceeds LSTM are concrete strengths. The study also supplies falsifiable accuracy numbers that can be tested on new repositories.
major comments (2)
- [§3] §3 (Data Collection and Reconstruction): The reconstruction of historical Maintained scores is load-bearing for every reported accuracy. The manuscript must detail exactly how the OpenSSF Scorecard's 90-day checks (commit frequency, issue resolution, etc.) were replicated from archived GitHub data, including any approximations for missing events or changes in the scorecard implementation itself. Without this, the training targets may contain systematic bias relative to what an actual scorecard run would have produced at each past date.
- [§5] §5 (Results and Evaluation): The accuracies >0.95 for bucketed scores and >0.79 for trend types are presented without a persistence baseline (predict last observed bucket or trend). Because maintenance activity can be stable over short horizons, a naive baseline may already achieve comparable performance; its absence prevents readers from judging whether the models extract genuine predictive signal.
minor comments (2)
- [Abstract] The abstract and results should explicitly state the primary metric (accuracy, macro-F1, etc.) and the exact configurations that produced the headline numbers rather than aggregating across all windows/horizons.
- [Figures/Tables] Figure captions and tables should include the number of repositories and time steps used for each reported accuracy to allow direct assessment of statistical power.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for major revision. The two major comments identify important gaps in methodological transparency and evaluation rigor. We address each below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Data Collection and Reconstruction): The reconstruction of historical Maintained scores is load-bearing for every reported accuracy. The manuscript must detail exactly how the OpenSSF Scorecard's 90-day checks (commit frequency, issue resolution, etc.) were replicated from archived GitHub data, including any approximations for missing events or changes in the scorecard implementation itself. Without this, the training targets may contain systematic bias relative to what an actual scorecard run would have produced at each past date.
Authors: We agree that the current description in §3 is insufficiently detailed. In the revised manuscript we will expand §3 with a precise account of the reconstruction pipeline: the exact GitHub event types and time windows used to emulate each Maintained sub-check, the rules applied when events are missing from the archive, and any adjustments made for known changes in the official scorecard implementation during the three-year period. We will also add a short discussion of possible systematic biases and the steps taken to limit their impact on the reported accuracies. revision: yes
-
Referee: [§5] §5 (Results and Evaluation): The accuracies >0.95 for bucketed scores and >0.79 for trend types are presented without a persistence baseline (predict last observed bucket or trend). Because maintenance activity can be stable over short horizons, a naive baseline may already achieve comparable performance; its absence prevents readers from judging whether the models extract genuine predictive signal.
Authors: We concur that the absence of a persistence baseline limits interpretability. The revised §5 will include a persistence baseline that simply carries forward the most recent observed bucket (or trend category) for the forecast horizon. We will report all model accuracies alongside this baseline for every training-window / horizon combination, allowing readers to see the incremental value of the learned models. revision: yes
Circularity Check
No significant circularity; empirical forecasting on reconstructed historical data
full rationale
The paper reconstructs historical Maintained scores from GitHub activity over three years and frames forecasting as a standard multivariate time-series task. It trains Random Forest and LSTM models on windows of 3-12 months to predict 1-6 month horizons, reporting accuracies on held-out data for raw scores, bucketed categories, slopes, and trend types. No derivation reduces to its inputs by construction: the accuracies are empirical model performance metrics, not tautological. No self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations. The chain is data-driven and externally falsifiable against future observed scores.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Historical Maintained scores can be reliably reconstructed from GitHub commit and activity data over a three-year window.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reconstruct historical Maintained scores over a three-year period and frame the problem as a multivariate time series forecasting task... accuracies above 0.95 and 0.79.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The Maintained score formula... sliding 90-day activity aggregation... round(g(t) · min(10, M̂(t)))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2026. Replication Package. (1 2026). https://figshare.com/s/30f440ade69c5f0b7994
work page 2026
-
[2]
Rabe Abdalkareem, Vinicius Oda, Suhaib Mujahid, and Emad Shihab. 2020. On the impact of using trivial packages: An empirical case study on npm and pypi. Empirical Software Engineering25 (2020), 1168–1204
work page 2020
-
[3]
Apache Software Foundation. 2021. Log4j 2. https://logging.apache.org/log4j/2.x/ Accessed: 2026-01-21
work page 2021
-
[4]
Houda Bakir, Ghassen Chniti, and Hédi Zaher. 2018. E-Commerce price forecast- ing using LSTM neural networks.International Journal of Machine Learning and Computing8, 2 (2018), 169–174
work page 2018
-
[5]
Veronika Bauer, Lars Heinemann, and Florian Deissenboeck. 2012. A structured approach to assess third-party library usage. In2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 483–492
work page 2012
-
[6]
Russ Cox. 2019. Surviving software dependencies.Commun. ACM62, 9 (2019), 36–43
work page 2019
-
[7]
Alexandre Decan, Eleni Constantinou, Tom Mens, and Henrique Rocha. 2020. GAP: Forecasting commit activity in git projects.Journal of Systems and Software 165 (2020), 110573
work page 2020
-
[8]
Alexandre Decan, Tom Mens, and Maelick Claes. 2016. On the topology of package dependency networks: A comparison of three programming language ecosystems. InProccedings of the 10th european conference on software architecture workshops. 1–4
work page 2016
-
[9]
Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. InProceedings of the 15th international conference on mining software repositories. 181–191
work page 2018
-
[10]
Christof Ebert. 2008. Open source software in industry.IEEE Software25, 3 (2008), 52–53
work page 2008
-
[11]
2020.Working in public: the making and maintenance of open source software
Nadia Eghbal. 2020.Working in public: the making and maintenance of open source software. Stripe Press
work page 2020
-
[12]
2025.Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1
GitHub. 2025.Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. https://github.blog/news-insights/octoverse/octoverse-a-new- developer-joins-github-every-second-as-ai-leads-typescript-to-1/ Accessed: 2026-01-21
work page 2025
-
[13]
Omid Hamidi, Leili Tapak, Hamed Abbasi, and Zohreh Maryanaji. 2018. Appli- cation of random forest time series, support vector regression and multivariate adaptive regression splines models in prediction of snowfall (a case study of Alvand in the middle Zagros, Iran).Theoretical and Applied Climatology134, 3 (2018), 769–776
work page 2018
- [14]
-
[15]
S Hochreiter. 1997. Long Short-term Memory.Neural Computation MIT-Press (1997)
work page 1997
-
[16]
Wenxiang Li and KL Eddie Law. 2024. Deep learning models for time series forecasting: a review.IEEE Access(2024)
work page 2024
-
[17]
Massimiliano Marcellino, James H Stock, and Mark W Watson. 2006. A compari- son of direct and iterated multistep AR methods for forecasting macroeconomic time series.Journal of econometrics135, 1-2 (2006), 499–526
work page 2006
-
[18]
Kasun Mendis, Manjusri Wickramasinghe, and Pasindu Marasinghe. 2024. Mul- tivariate time series forecasting: A review. InProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition. 1–9
work page 2024
-
[19]
Motahare Mounesan, Hossein Siadati, and Sima Jafarikhah. 2023. Exploring the Threat of Software Supply Chain Attacks on Containerized Applications. In2023 16th International Conference on Security of Information and Networks (SIN). IEEE, 1–8
work page 2023
-
[20]
Suhaib Mujahid, Diego Elias Costa, Rabe Abdalkareem, Emad Shihab, Mo- hamed Aymen Saied, and Bram Adams. 2021. Toward using package centrality trend to identify packages in decline.IEEE Transactions on Engineering Manage- ment(2021)
work page 2021
-
[21]
Elisa Mussumeci and Flávio Codeço Coelho. 2020. Large-scale multivariate forecasting models for Dengue-LSTM versus random forest regression.Spatial and Spatio-temporal Epidemiology35 (2020), 100372
work page 2020
- [22]
-
[23]
OpenSSF. 2023. OpenSSF Scorecard. https://github.com/ossf/scorecard. Accessed: 2026-01-21
work page 2023
-
[24]
OpenSSF. 2024. XZ Backdoor (CVE-2024-3094). https://openssf.org/blog/2024/ 03/30/xz-backdoor-cve-2024-3094/. Accessed: 2026-01-21
work page 2024
-
[25]
OpenSSF. 2026. Dependents of the ossf/scorecard-action GitHub Ac- tion. https://github.com/ossf/scorecard-action/network/dependents?package_ id=UGFja2FnZS0yOTQyNTYwNTcz. Accessed: 2026-01-21
work page 2026
-
[26]
OpenSSF. 2026. OpenSSF Scorecard GitHub Repository—Prominent Score- card Users. https://github.com/ossf/scorecard/tree/main?tab=readme-ov-file# prominent-scorecard-users. Accessed: 2026-01-21
work page 2026
-
[27]
2022.The Open Source Software Security Mobilization Plan
OpenSSF and LF. 2022.The Open Source Software Security Mobilization Plan. https://openssf.org/oss-security-mobilization-plan/ Accessed: 2026-01-21
work page 2022
-
[28]
Mike Pittenger. 2016. Open source security analysis: The state of open source security in commercial applications.Black Duck Software, Tech. Rep(2016)
work page 2016
-
[29]
Python Software Foundation. 2026. PyPI JSON API. https://pypi.org/pypi/ <package_name>/json. Accessed: 2026-01-21
work page 2026
-
[30]
Python Software Foundation. 2026. PyPI Simple API. https://pypi.org/simple/. Accessed: 2026-01-21
work page 2026
-
[31]
Steven Raemaekers, Arie van Deursen, and Joost Visser. 2011. Exploring risks in the usage of third-party libraries. Inof the BElgian-NEtherlands software eVOLution seminar. 31
work page 2011
-
[32]
Kristiina Rahkema and Dietmar Pfahl. 2022. SwiftDependencyChecker: Detecting Vulnerable Dependencies Declared Through CocoaPods, Carthage and Swift PM. In9th International Conference on Mobile Software Engineering and Systems (MobileSoft). IEEE, 107–111
work page 2022
-
[33]
Anton Romanov, Nadezhda Yarushkina, Alexey Filippov, Pavel Sergeev, Ilya Andreev, and Sergey Kiselev. 2023. Time series forecasting during software project state analysis.Mathematics12, 1 (2023), 47
work page 2023
-
[34]
Pablo Romeu, Francisco Zamora-Martínez, Paloma Botella-Rocamora, and Juan Pardo. 2013. Time-series forecasting of indoor temperature using pre-trained deep neural networks. InArtificial Neural Networks and Machine Learning– ICANN 2013: 23rd International Conference on Artificial Neural Networks Sofia, Bulgaria, September 10-13, 2013. Proceedings 23. Sprin...
work page 2013
-
[35]
Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering.Empirical software engineering14, 2 (2009), 131–164
work page 2009
-
[36]
Munish Saini and Kuljit Kaur. 2016. Fuzzy analysis and prediction of commit activity in open source software projects.IET Software10, 5 (2016), 136–146
work page 2016
-
[37]
Snyk Security Team. 2022. The Colors and Faker NPM Packages Go Rogue. https://snyk.io/de/blog/open-source-npm-packages-colors-faker/. Accessed: 2026-01-21
work page 2022
-
[38]
Evangelos Spiliotis. 2023. Time Series Forecasting with Statistical, Machine Learning, and Deep Learning Methods: Past, Present, and Future. InForecasting with Artificial Intelligence: Theory and Applications. Springer, 49–75
work page 2023
-
[39]
Valentina Tessoni and Michele Amoretti. 2022. Advanced statistical and machine learning methods for multi-step multivariate time series forecasting in predictive maintenance.Procedia Computer Science200 (2022), 748–757
work page 2022
- [40]
-
[41]
Alexandros Tsakpinis. 2023. Analyzing Maintenance Activities of Software Libraries. InProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 313–318
work page 2023
-
[42]
Alexandros Tsakpinis and Alexander Pretschner. 2024. Analyzing the Accessibil- ity of GitHub Repositories for PyPI and NPM Libraries. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 345–350
work page 2024
-
[43]
Alexandros Tsakpinis and Alexander Pretschner. 2025. Analyzing the Usage of Donation Platforms for PyPI Libraries. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering. 628–633
work page 2025
-
[44]
Vaidya, Lorenzo De Carli, Drew Davidson, and Vaibhav Ras- togi
Ruturaj K. Vaidya, Lorenzo De Carli, Drew Davidson, and Vaibhav Ras- togi. 2021. Security Issues in Language-based Software Ecosystems. arXiv:1903.02613 [cs.CR] https://arxiv.org/abs/1903.02613
- [45]
-
[46]
Jiaming Yin, Weixiong Rao, Mingxuan Yuan, Jia Zeng, Kai Zhao, Chenxi Zhang, Jiangfeng Li, and Qinpei Zhao. 2019. Experimental study of multivariate time series forecasting models. InProceedings of the 28th ACM international conference on information and knowledge management. 2833–2839
work page 2019
-
[47]
Awad A Younis, Yi Hu, and Ramadan Abdunabi. 2023. Analyzing Software Supply Chain Security Risks in Industrial Control System Protocols: An OpenSSF Scorecard Approach. In2023 10th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 302–311
work page 2023
-
[48]
Nusrat Zahan, Parth Kanakiya, Brian Hambleton, Shohanuzzaman Shohan, and Laurie Williams. 2023. Openssf scorecard: On the path toward ecosystem-wide automated security metrics.IEEE Security & Privacy21, 6 (2023), 76–88
work page 2023
-
[49]
Nusrat Zahan, Shohanuzzaman Shohan, Dan Harris, and Laurie Williams. 2023. Do software security practices yield fewer vulnerabilities?. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 292–303
work page 2023
-
[50]
Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chan- dra Maddila, and Laurie Williams. 2022. What are weak links in the npm supply chain?. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331–340
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.