pith. machine review for the scientific record. sign in

arxiv: 2601.18344 · v2 · submitted 2026-01-26 · 💻 cs.SE

Forecasting the Maintained Score from the OpenSSF Scorecard: A Study of GitHub Repositories Linked to PyPI Packages

Pith reviewed 2026-05-16 10:52 UTC · model grok-4.3

classification 💻 cs.SE
keywords OpenSSF ScorecardMaintained scoremaintenance forecastingtime series forecastingGitHub repositoriesPyPI packagesRandom ForestLSTM
0
0 comments X

The pith

Future maintenance activity in open-source GitHub repositories can be forecasted from historical OpenSSF Maintained scores with meaningful accuracy using aggregated targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether past OpenSSF Maintained scores can predict future maintenance levels for GitHub repositories linked to top PyPI packages. It reconstructs three years of historical scores and frames the task as multivariate time series forecasting with four target formats: raw scores, three-level buckets, numerical trends, and categorical trend types. Random Forest and LSTM models are compared across training windows of 3-12 months and forecast horizons of 1-6 months. Results show strong performance for bucketed scores above 0.95 accuracy and trend types above 0.79, with simpler models matching complex ones. This matters because the existing score only offers a 90-day retrospective snapshot and provides no forward view for dependency risk assessment.

Core claim

The paper claims that future maintenance activity as captured by the OpenSSF Maintained score can be forecasted with meaningful accuracy from historical data, particularly when using aggregated representations such as bucketed scores and trend types that reach accuracies above 0.95 and 0.79, and that a standard Random Forest model performs at least as well as an LSTM for this task.

What carries the argument

The central mechanism is the multivariate time series forecasting setup that converts historical Maintained scores into four target representations—raw numerical value, low-moderate-high buckets, numerical slope, and categorical trend direction—then applies Random Forest and LSTM models on 3-12 month training windows to predict 1-6 month horizons.

If this is right

  • Developers and organizations could use forecasts to identify likely future maintenance drops and prioritize dependency reviews.
  • Bucketed and categorical trend targets deliver substantially higher accuracy than raw numerical scores.
  • Standard Random Forest models suffice and remove the need for complex deep learning architectures in this domain.
  • Forecasts remain useful across horizons of one to six months when aggregated representations are used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Dependency management tools could incorporate these forecasts to trigger automated alerts for packages at risk of abandonment.
  • The same reconstruction-and-forecast pipeline might extend to other OpenSSF Scorecard metrics such as security practices or licensing.
  • Similar time-series approaches could apply to repositories in additional ecosystems such as npm or Maven without major redesign.
  • Accuracy would likely improve if models incorporated additional signals like commit frequency or contributor counts beyond the Maintained score alone.

Load-bearing premise

The historical Maintained scores reconstructed over three years from GitHub activity logs are accurate and representative enough to serve as reliable training targets.

What would settle it

An independent replication on a new collection of repositories that finds bucketed-score forecasting accuracy below 0.8 would falsify the claim of meaningful predictive performance.

Figures

Figures reproduced from arXiv: 2601.18344 by Alexander Pretschner, Alexandros Tsakpinis, Efe Berk Erg\"ule\c{c}, Emil Schwenger.

Figure 2
Figure 2. Figure 2: Existing forecasting types Depending on the problem definition, some studies suggest that multi-step-ahead iterative forecasting is more efficient [17], while others advocate for multi-step-ahead direct forecasting, particularly for longer prediction horizons [46]. In this study, we prioritize the latter approach and limit the forecasting horizon to six months, as forecast accuracy generally declines with … view at source ↗
Figure 3
Figure 3. Figure 3: Forecasting the Maintained score of the OpenSSF Scorecard (dates shown are exemplary) The following sections provide a detailed explanation of the data collection and preprocessing phase, the time series analysis conducted to understand the characteristics of the dataset, and the specific steps involved in the model development and evaluation. 3.1 Data Collection and Preprocessing 3.1.1 GitHub Repository U… view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of OpenSSF Maintained scores Representation of the Target Variable as Bucketed Score. While the raw Maintained score provides a fine-grained view of repository activity, its distribution remains imbalanced, with the majority of observations concentrated at the extreme values of 0 and 10. This imbalance limits the ability of forecasting models to effectively learn patterns across the full score… view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of bucketed scores [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of trend type [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Aggregated accuracy across model families and [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Confusion matrices to asses misclassifications [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
read the original abstract

Background: The OpenSSF Scorecard is widely used to assess the security posture of open-source software repositories, with the Maintained metric serving as a key indicator of recent maintenance activities, helping users identify actively maintained projects and potentially abandoned dependencies. However, the metric is inherently retrospective, providing only a short-term snapshot based on the past 90 days of repository activity and offering no insight into the future. This limitation complicates risk assessment for developers and organizations that rely on open-source dependencies. Aims: In this paper, we investigate the feasibility of forecasting future maintenance activities as captured by the OpenSSF Maintained score. Method: Focusing on 3,220 GitHub repositories linked to one of the top 1% most central PyPI libraries, as ranked by PageRank, we reconstruct historical Maintained scores over a three-year period and frame the problem as a multivariate time series forecasting task. We study four target representations: the raw Maintained score (0-10), a bucketed score capturing low (0-2), moderate (3-7), and high (8-10) maintenance levels, the numerical trend slope between consecutive scores, and categorical trend types (downward, stable, upward). We compare a machine learning model (Random Forest) and a deep learning model (LSTM) using training windows of 3-12 months and forecasting horizons of 1-6 months. Results: Our results show that future maintenance activity can be forecasted with meaningful accuracy, particularly when using aggregated representations such as bucketed scores and trend types leading to accuracies above 0.95 and 0.79. Notably, simpler machine learning models perform at least on par with deep learning approaches, suggesting that effective forecasting does not require complex architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study forecasting future OpenSSF Maintained scores (0-10 scale) for 3,220 GitHub repositories linked to top-1% PyPI packages by PageRank. Historical scores are reconstructed over three years from GitHub activity data and the problem is framed as multivariate time-series forecasting. Four target representations are studied (raw score, bucketed low/moderate/high, numerical trend slope, categorical trend type) using Random Forest and LSTM models across training windows of 3-12 months and horizons of 1-6 months. The central claim is that meaningful accuracy is achievable, especially with aggregated representations (bucketed scores >0.95, trend types >0.79), and that simpler ML models perform at least as well as deep learning.

Significance. If the forecasting results hold under validated targets, the work has clear practical value for software supply-chain risk assessment by shifting the Maintained metric from purely retrospective to predictive. The large-scale real-world dataset, systematic comparison of target encodings and model families, and the finding that Random Forest matches or exceeds LSTM are concrete strengths. The study also supplies falsifiable accuracy numbers that can be tested on new repositories.

major comments (2)
  1. [§3] §3 (Data Collection and Reconstruction): The reconstruction of historical Maintained scores is load-bearing for every reported accuracy. The manuscript must detail exactly how the OpenSSF Scorecard's 90-day checks (commit frequency, issue resolution, etc.) were replicated from archived GitHub data, including any approximations for missing events or changes in the scorecard implementation itself. Without this, the training targets may contain systematic bias relative to what an actual scorecard run would have produced at each past date.
  2. [§5] §5 (Results and Evaluation): The accuracies >0.95 for bucketed scores and >0.79 for trend types are presented without a persistence baseline (predict last observed bucket or trend). Because maintenance activity can be stable over short horizons, a naive baseline may already achieve comparable performance; its absence prevents readers from judging whether the models extract genuine predictive signal.
minor comments (2)
  1. [Abstract] The abstract and results should explicitly state the primary metric (accuracy, macro-F1, etc.) and the exact configurations that produced the headline numbers rather than aggregating across all windows/horizons.
  2. [Figures/Tables] Figure captions and tables should include the number of repositories and time steps used for each reported accuracy to allow direct assessment of statistical power.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for major revision. The two major comments identify important gaps in methodological transparency and evaluation rigor. We address each below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Data Collection and Reconstruction): The reconstruction of historical Maintained scores is load-bearing for every reported accuracy. The manuscript must detail exactly how the OpenSSF Scorecard's 90-day checks (commit frequency, issue resolution, etc.) were replicated from archived GitHub data, including any approximations for missing events or changes in the scorecard implementation itself. Without this, the training targets may contain systematic bias relative to what an actual scorecard run would have produced at each past date.

    Authors: We agree that the current description in §3 is insufficiently detailed. In the revised manuscript we will expand §3 with a precise account of the reconstruction pipeline: the exact GitHub event types and time windows used to emulate each Maintained sub-check, the rules applied when events are missing from the archive, and any adjustments made for known changes in the official scorecard implementation during the three-year period. We will also add a short discussion of possible systematic biases and the steps taken to limit their impact on the reported accuracies. revision: yes

  2. Referee: [§5] §5 (Results and Evaluation): The accuracies >0.95 for bucketed scores and >0.79 for trend types are presented without a persistence baseline (predict last observed bucket or trend). Because maintenance activity can be stable over short horizons, a naive baseline may already achieve comparable performance; its absence prevents readers from judging whether the models extract genuine predictive signal.

    Authors: We concur that the absence of a persistence baseline limits interpretability. The revised §5 will include a persistence baseline that simply carries forward the most recent observed bucket (or trend category) for the forecast horizon. We will report all model accuracies alongside this baseline for every training-window / horizon combination, allowing readers to see the incremental value of the learned models. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical forecasting on reconstructed historical data

full rationale

The paper reconstructs historical Maintained scores from GitHub activity over three years and frames forecasting as a standard multivariate time-series task. It trains Random Forest and LSTM models on windows of 3-12 months to predict 1-6 month horizons, reporting accuracies on held-out data for raw scores, bucketed categories, slopes, and trend types. No derivation reduces to its inputs by construction: the accuracies are empirical model performance metrics, not tautological. No self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations. The chain is data-driven and externally falsifiable against future observed scores.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that past 90-day Maintained scores can be accurately reconstructed from GitHub activity logs and that these reconstructions form a valid multivariate time series. No new physical entities or ad-hoc constants are introduced. Standard ML hyperparameters are present but not load-bearing for the reported accuracies.

axioms (1)
  • domain assumption Historical Maintained scores can be reliably reconstructed from GitHub commit and activity data over a three-year window.
    Invoked when framing the problem as a time-series forecasting task on reconstructed scores.

pith-pipeline@v0.9.0 · 5640 in / 1321 out tokens · 51435 ms · 2026-05-16T10:52:07.249707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    Replication Package

    2026. Replication Package. (1 2026). https://figshare.com/s/30f440ade69c5f0b7994

  2. [2]

    Rabe Abdalkareem, Vinicius Oda, Suhaib Mujahid, and Emad Shihab. 2020. On the impact of using trivial packages: An empirical case study on npm and pypi. Empirical Software Engineering25 (2020), 1168–1204

  3. [3]

    Apache Software Foundation. 2021. Log4j 2. https://logging.apache.org/log4j/2.x/ Accessed: 2026-01-21

  4. [4]

    Houda Bakir, Ghassen Chniti, and Hédi Zaher. 2018. E-Commerce price forecast- ing using LSTM neural networks.International Journal of Machine Learning and Computing8, 2 (2018), 169–174

  5. [5]

    Veronika Bauer, Lars Heinemann, and Florian Deissenboeck. 2012. A structured approach to assess third-party library usage. In2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, 483–492

  6. [6]

    Russ Cox. 2019. Surviving software dependencies.Commun. ACM62, 9 (2019), 36–43

  7. [7]

    Alexandre Decan, Eleni Constantinou, Tom Mens, and Henrique Rocha. 2020. GAP: Forecasting commit activity in git projects.Journal of Systems and Software 165 (2020), 110573

  8. [8]

    Alexandre Decan, Tom Mens, and Maelick Claes. 2016. On the topology of package dependency networks: A comparison of three programming language ecosystems. InProccedings of the 10th european conference on software architecture workshops. 1–4

  9. [9]

    Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. InProceedings of the 15th international conference on mining software repositories. 181–191

  10. [10]

    Christof Ebert. 2008. Open source software in industry.IEEE Software25, 3 (2008), 52–53

  11. [11]

    2020.Working in public: the making and maintenance of open source software

    Nadia Eghbal. 2020.Working in public: the making and maintenance of open source software. Stripe Press

  12. [12]

    2025.Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1

    GitHub. 2025.Octoverse: A new developer joins GitHub every second as AI leads TypeScript to #1. https://github.blog/news-insights/octoverse/octoverse-a-new- developer-joins-github-every-second-as-ai-leads-typescript-to-1/ Accessed: 2026-01-21

  13. [13]

    Omid Hamidi, Leili Tapak, Hamed Abbasi, and Zohreh Maryanaji. 2018. Appli- cation of random forest time series, support vector regression and multivariate adaptive regression splines models in prediction of snowfall (a case study of Alvand in the middle Zagros, Iran).Theoretical and Applied Climatology134, 3 (2018), 769–776

  14. [14]

    David J Hand, Peter Christen, and Sumayya Ziyad. 2024. Selecting a classification performance measure: matching the measure to the problem.arXiv preprint arXiv:2409.12391(2024)

  15. [15]

    S Hochreiter. 1997. Long Short-term Memory.Neural Computation MIT-Press (1997)

  16. [16]

    Wenxiang Li and KL Eddie Law. 2024. Deep learning models for time series forecasting: a review.IEEE Access(2024)

  17. [17]

    Massimiliano Marcellino, James H Stock, and Mark W Watson. 2006. A compari- son of direct and iterated multistep AR methods for forecasting macroeconomic time series.Journal of econometrics135, 1-2 (2006), 499–526

  18. [18]

    Kasun Mendis, Manjusri Wickramasinghe, and Pasindu Marasinghe. 2024. Mul- tivariate time series forecasting: A review. InProceedings of the 2024 2nd Asia Conference on Computer Vision, Image Processing and Pattern Recognition. 1–9

  19. [19]

    Motahare Mounesan, Hossein Siadati, and Sima Jafarikhah. 2023. Exploring the Threat of Software Supply Chain Attacks on Containerized Applications. In2023 16th International Conference on Security of Information and Networks (SIN). IEEE, 1–8

  20. [20]

    Suhaib Mujahid, Diego Elias Costa, Rabe Abdalkareem, Emad Shihab, Mo- hamed Aymen Saied, and Bram Adams. 2021. Toward using package centrality trend to identify packages in decline.IEEE Transactions on Engineering Manage- ment(2021)

  21. [21]

    Elisa Mussumeci and Flávio Codeço Coelho. 2020. Large-scale multivariate forecasting models for Dengue-LSTM versus random forest regression.Spatial and Spatio-temporal Epidemiology35 (2020), 100372

  22. [22]

    Igor Nunes, Mike Heddes, Pere Vergés, Danny Abraham, Alexander Veidenbaum, Alexandru Nicolau, and Tony Givargis. [n. d.]. DotHash: Estimating Set Similar- ity Metrics for Link Prediction and Document Deduplication, May 2023.URL http://arxiv. org/abs/2305.17310([n. d.])

  23. [23]

    OpenSSF. 2023. OpenSSF Scorecard. https://github.com/ossf/scorecard. Accessed: 2026-01-21

  24. [24]

    OpenSSF. 2024. XZ Backdoor (CVE-2024-3094). https://openssf.org/blog/2024/ 03/30/xz-backdoor-cve-2024-3094/. Accessed: 2026-01-21

  25. [25]

    OpenSSF. 2026. Dependents of the ossf/scorecard-action GitHub Ac- tion. https://github.com/ossf/scorecard-action/network/dependents?package_ id=UGFja2FnZS0yOTQyNTYwNTcz. Accessed: 2026-01-21

  26. [26]

    OpenSSF. 2026. OpenSSF Scorecard GitHub Repository—Prominent Score- card Users. https://github.com/ossf/scorecard/tree/main?tab=readme-ov-file# prominent-scorecard-users. Accessed: 2026-01-21

  27. [27]

    2022.The Open Source Software Security Mobilization Plan

    OpenSSF and LF. 2022.The Open Source Software Security Mobilization Plan. https://openssf.org/oss-security-mobilization-plan/ Accessed: 2026-01-21

  28. [28]

    Mike Pittenger. 2016. Open source security analysis: The state of open source security in commercial applications.Black Duck Software, Tech. Rep(2016)

  29. [29]

    Python Software Foundation. 2026. PyPI JSON API. https://pypi.org/pypi/ <package_name>/json. Accessed: 2026-01-21

  30. [30]

    Python Software Foundation. 2026. PyPI Simple API. https://pypi.org/simple/. Accessed: 2026-01-21

  31. [31]

    Steven Raemaekers, Arie van Deursen, and Joost Visser. 2011. Exploring risks in the usage of third-party libraries. Inof the BElgian-NEtherlands software eVOLution seminar. 31

  32. [32]

    Kristiina Rahkema and Dietmar Pfahl. 2022. SwiftDependencyChecker: Detecting Vulnerable Dependencies Declared Through CocoaPods, Carthage and Swift PM. In9th International Conference on Mobile Software Engineering and Systems (MobileSoft). IEEE, 107–111

  33. [33]

    Anton Romanov, Nadezhda Yarushkina, Alexey Filippov, Pavel Sergeev, Ilya Andreev, and Sergey Kiselev. 2023. Time series forecasting during software project state analysis.Mathematics12, 1 (2023), 47

  34. [34]

    Pablo Romeu, Francisco Zamora-Martínez, Paloma Botella-Rocamora, and Juan Pardo. 2013. Time-series forecasting of indoor temperature using pre-trained deep neural networks. InArtificial Neural Networks and Machine Learning– ICANN 2013: 23rd International Conference on Artificial Neural Networks Sofia, Bulgaria, September 10-13, 2013. Proceedings 23. Sprin...

  35. [35]

    Per Runeson and Martin Höst. 2009. Guidelines for conducting and reporting case study research in software engineering.Empirical software engineering14, 2 (2009), 131–164

  36. [36]

    Munish Saini and Kuljit Kaur. 2016. Fuzzy analysis and prediction of commit activity in open source software projects.IET Software10, 5 (2016), 136–146

  37. [37]

    Snyk Security Team. 2022. The Colors and Faker NPM Packages Go Rogue. https://snyk.io/de/blog/open-source-npm-packages-colors-faker/. Accessed: 2026-01-21

  38. [38]

    Evangelos Spiliotis. 2023. Time Series Forecasting with Statistical, Machine Learning, and Deep Learning Methods: Past, Present, and Future. InForecasting with Artificial Intelligence: Theory and Applications. Springer, 49–75

  39. [39]

    Valentina Tessoni and Michele Amoretti. 2022. Advanced statistical and machine learning methods for multi-step multivariate time series forecasting in predictive maintenance.Procedia Computer Science200 (2022), 748–757

  40. [40]

    Gonzalo Travieso, Alexandre Benatti, and Luciano da F Costa. 2024. An Analytical Approach to the Jaccard Similarity Index.arXiv preprint arXiv:2410.16436(2024)

  41. [41]

    Alexandros Tsakpinis. 2023. Analyzing Maintenance Activities of Software Libraries. InProceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 313–318

  42. [42]

    Alexandros Tsakpinis and Alexander Pretschner. 2024. Analyzing the Accessibil- ity of GitHub Repositories for PyPI and NPM Libraries. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering. 345–350

  43. [43]

    Alexandros Tsakpinis and Alexander Pretschner. 2025. Analyzing the Usage of Donation Platforms for PyPI Libraries. InProceedings of the 29th International Conference on Evaluation and Assessment in Software Engineering. 628–633

  44. [44]

    Vaidya, Lorenzo De Carli, Drew Davidson, and Vaibhav Ras- togi

    Ruturaj K. Vaidya, Lorenzo De Carli, Drew Davidson, and Vaibhav Ras- togi. 2021. Security Issues in Language-based Software Ecosystems. arXiv:1903.02613 [cs.CR] https://arxiv.org/abs/1903.02613

  45. [45]

    Yiming Xu, Runzhi He, Hengzhi Ye, Minghui Zhou, and Huaimin Wang. 2025. Predicting Abandonment of Open Source Software Projects with An Integrated Feature Framework. arXiv:2507.21678 [cs.SE] https://arxiv.org/abs/2507.21678

  46. [46]

    Jiaming Yin, Weixiong Rao, Mingxuan Yuan, Jia Zeng, Kai Zhao, Chenxi Zhang, Jiangfeng Li, and Qinpei Zhao. 2019. Experimental study of multivariate time series forecasting models. InProceedings of the 28th ACM international conference on information and knowledge management. 2833–2839

  47. [47]

    Awad A Younis, Yi Hu, and Ramadan Abdunabi. 2023. Analyzing Software Supply Chain Security Risks in Industrial Control System Protocols: An OpenSSF Scorecard Approach. In2023 10th International Conference on Dependable Systems and Their Applications (DSA). IEEE, 302–311

  48. [48]

    Nusrat Zahan, Parth Kanakiya, Brian Hambleton, Shohanuzzaman Shohan, and Laurie Williams. 2023. Openssf scorecard: On the path toward ecosystem-wide automated security metrics.IEEE Security & Privacy21, 6 (2023), 76–88

  49. [49]

    Nusrat Zahan, Shohanuzzaman Shohan, Dan Harris, and Laurie Williams. 2023. Do software security practices yield fewer vulnerabilities?. In2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 292–303

  50. [50]

    Nusrat Zahan, Thomas Zimmermann, Patrice Godefroid, Brendan Murphy, Chan- dra Maddila, and Laurie Williams. 2022. What are weak links in the npm supply chain?. InProceedings of the 44th International Conference on Software Engineering: Software Engineering in Practice. 331–340