Patterns of Effort Contribution and Demand and User Classification based on Participation Patterns in NPM Ecosystem
Pith reviewed 2026-05-24 21:28 UTC · model grok-4.3
The pith
Developers in NPM primarily contribute to and demand effort from direct dependencies rather than transitive ones
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68.
What carries the argument
Fuzzy c-means clustering applied to per-user effort contribution and demand vectors built from issues, pull requests, and commit activity linked to direct and transitive NPM dependencies.
If this is right
- Most maintainer effort can focus on direct rather than transitive relationships.
- A meaningful share of demand originates outside any single user's visible supply chain and may require separate support mechanisms.
- User groups defined by demand versus contribution patterns may need distinct engagement or tooling approaches.
- Participation patterns carry a detectable signal for inferring commercial versus volunteer status.
- Increasing upstream visibility is needed to better align demand with supply across the ecosystem.
Where Pith is reading between the lines
- The same clustering and prediction approach could be tested on other package registries to check whether direct-dependency concentration is general.
- Incomplete public supply chains may systematically understate the true reach of demand, suggesting a need for hybrid public-private dependency mapping.
- If company affiliation can be inferred from patterns, ecosystems might use similar signals to measure commercial involvement without self-reporting.
- The existence of demand outside supply chains implies potential sustainability risks for packages that receive effort without reciprocal contribution links.
Load-bearing premise
The supply chains constructed from public version control data accurately reflect real dependencies and the public commit records of issue creators capture their full participation without substantial private or missing activity.
What would settle it
A dataset that includes private repositories and finds large fractions of contributions or demands going to transitive dependencies would falsify the concentration on direct dependencies.
Figures
read the original abstract
Background: Open source requires participation of volunteer and commercial developers (users) in order to deliver functional high-quality components. Developers both contribute effort in the form of patches and demand effort from the component maintainers to resolve issues reported against it. Aim: Identify and characterize patterns of effort contribution and demand throughout the open source supply chain and investigate if and how these patterns vary with developer activity; identify different groups of developers; and predict developers' company affiliation based on their participation patterns. Method: 1,376,946 issues and pull-requests created for 4433 NPM packages with over 10,000 monthly downloads and full (public) commit activity data of the 272,142 issue creators is obtained and analyzed and dependencies on NPM packages are identified. Fuzzy c-means clustering algorithm is used to find the groups among the users based on their effort contribution and demand patterns, and Random Forest is used as the predictive modeling technique to identify their company affiliations. Result: Users contribute and demand effort primarily from packages that they depend on directly with only a tiny fraction of contributions and demand going to transitive dependencies. A significant portion of demand goes into packages outside the users' respective supply chains (constructed based on publicly visible version control data). Three and two different groups of users are observed based on the effort demand and effort contribution patterns respectively. The Random Forest model used for identifying the company affiliation of the users gives a AUC-ROC value of 0.68. Conclusion: Our results give new insights into effort demand and supply at different parts of the supply chain of the NPM ecosystem and its users and suggests the need to increase visibility further upstream.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes patterns of effort contribution (via commits) and demand (via issues/PRs) in the NPM ecosystem using data on 1,376,946 issues/PRs across 4433 high-download packages and full public commit histories for their 272,142 creators. It reports that both contribution and demand are overwhelmingly directed at direct dependencies (with negligible shares to transitive dependencies), that a substantial fraction of demand targets packages outside each user's inferred supply chain, that fuzzy c-means clustering identifies three distinct demand-pattern groups and two contribution-pattern groups, and that a Random Forest classifier predicts company affiliation from these patterns with AUC-ROC 0.68.
Significance. If the supply-chain inferences hold, the work supplies large-scale empirical evidence on how effort flows through real OSS dependency graphs and documents heterogeneity in developer behavior, which could inform maintenance prioritization and tooling for upstream visibility. The dataset scale (hundreds of thousands of users and over a million events) is a clear strength; the modest classifier performance and the reliance on public data alone are acknowledged limitations in the text.
major comments (2)
- [Method] Method section (dependency and supply-chain construction): the headline claim that a significant portion of demand falls outside users' supply chains rests on the accuracy of inferring each of the 272k users' direct and transitive dependencies solely from public commit histories and the dependency graph among the 4433 packages. The description does not specify how exact versions or version ranges are resolved, whether packages outside the 4433-set are considered, or how private/unlisted dependencies are handled; any systematic under-approximation would inflate the 'outside' bucket by construction and directly undermine the central supply-chain result.
- [Results] Results (clustering): the reported three demand groups and two contribution groups are obtained via fuzzy c-means with the number of clusters treated as a free parameter; no cluster-validity indices, stability analysis across random seeds, or sensitivity checks are described, yet these groupings are presented as a primary finding characterizing user heterogeneity.
minor comments (2)
- [Abstract / Method] The abstract and Method section would benefit from one additional sentence on the exact procedure used to map issues to packages and to build per-user supply chains (e.g., whether dependency metadata was taken from package.json at the time of the issue or from the latest version).
- [Results] Table or figure presenting the cluster centroids or feature distributions would make the 'three and two groups' claim easier to interpret without requiring the reader to reconstruct the patterns from text alone.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight areas where additional methodological detail and validation will strengthen the paper. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [Method] Method section (dependency and supply-chain construction): the headline claim that a significant portion of demand falls outside users' supply chains rests on the accuracy of inferring each of the 272k users' direct and transitive dependencies solely from public commit histories and the dependency graph among the 4433 packages. The description does not specify how exact versions or version ranges are resolved, whether packages outside the 4433-set are considered, or how private/unlisted dependencies are handled; any systematic under-approximation would inflate the 'outside' bucket by construction and directly undermine the central supply-chain result.
Authors: We agree that the current method description is insufficient to allow full assessment of the supply-chain inferences. In the revised manuscript we will add a dedicated subsection detailing the dependency extraction process: parsing of package.json files from the public commit histories of the 272,142 users, resolution of version ranges against the NPM registry snapshot available at data collection time, explicit restriction to the 4433-package dependency graph, and a clear statement of the limitation that private or unlisted dependencies cannot be observed. This expanded description will also discuss the potential for under-approximation and its implications for the 'outside supply chain' category, allowing readers to evaluate the robustness of the central result. revision: yes
-
Referee: [Results] Results (clustering): the reported three demand groups and two contribution groups are obtained via fuzzy c-means with the number of clusters treated as a free parameter; no cluster-validity indices, stability analysis across random seeds, or sensitivity checks are described, yet these groupings are presented as a primary finding characterizing user heterogeneity.
Authors: We accept that the clustering results require additional validation to be presented as robust characterizations of user heterogeneity. The revised version will include (i) computation of standard fuzzy cluster validity indices (Xie-Beni and fuzzy silhouette) to support the choice of three demand clusters and two contribution clusters, (ii) stability analysis by repeating fuzzy c-means across multiple random seeds and reporting variation in cluster assignments, and (iii) sensitivity checks on the fuzziness parameter m. These additions will be placed in the results section alongside the existing group descriptions. revision: yes
Circularity Check
No circularity in empirical data analysis
full rationale
The paper collects public NPM issue/PR and commit data for 4433 packages and 272k users, constructs supply chains from visible dependencies, applies fuzzy c-means clustering to identify user groups, and trains a Random Forest to predict company affiliation (AUC-ROC 0.68). No equations, derivations, or self-citations reduce any claim to its inputs by construction. All results are direct empirical outputs from standard statistical and ML methods applied to observed data; the supply-chain step is a one-time methodological preprocessing step whose outputs are then measured, not redefined. This is self-contained observational analysis with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- Number of clusters (demand) =
3
- Number of clusters (contribution) =
2
axioms (1)
- domain assumption Public commit activity data accurately represents user participation and supply chain dependencies.
Reference graph
Works this paper leans on
-
[1]
Christopher J Alberts, Audrey J Dorofee, Rita Creel, Robert J Ellison, and Carol Woody. 2011. A systemic approach for assessing software supply-chain risk. In 2011 44th Hawaii International Conference on System Sciences . IEEE, 1–8
work page 2011
-
[2]
Sadika Amreen, Bogdan Bichescu, Randy Bradley, Tapajit Dey, Yuxing Ma, Audris Mockus, Sara Mousavi, and Russell Zaretzki. 2019. A Methodology for Measuring FLOSS Ecosystems. In Towards Engineering Free/Libre Open Source Software (FLOSS) Ecosystems for Impact and Sustainability . Springer, Singapore, 1–29
work page 2019
-
[3]
James C Bezdek, Robert Ehrlich, and William Full. 1984. FCM: The fuzzy c-means clustering algorithm. Computers & Geosciences 10, 2-3 (1984), 191–203
work page 1984
-
[4]
Barry W. Boehm. 1991. Software risk management: principles and practices.IEEE software 8, 1 (1991), 32–41
work page 1991
-
[5]
Christopher Bogart, Christian Kästner, James Herbsleb, and Ferdian Thung. 2016. How to break an API: Cost negotiation and community values in three software ecosystems. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering . ACM, 109–120
work page 2016
-
[6]
Gerardo Canfora, Luigi Cerulo, Marta Cimitile, and Massimiliano Di Penta. 2011. Social interactions around cross-system bug fixings: the case of FreeBSD and OpenBSD. In Proceedings of the 8th working conference on mining software reposi- tories. ACM, 143–152
work page 2011
-
[7]
Patrick YK Chau and Kar Yan Tam. 1997. Factors affecting the adoption of open systems: an exploratory study. MIS quarterly (1997), 1–24
work page 1997
-
[8]
Malgorzata Ciesielska and Ann Westenholz. 2016. Dilemmas within commer- cial involvement in open source software. Journal of Organizational Change Management 29, 3 (2016), 344–360
work page 2016
-
[9]
Kevin Crowston and James Howison. 2003. The social structure of open source software development teams. (2003)
work page 2003
-
[10]
Alexandre Decan, Tom Mens, and Maëlick Claes. 2017. An empirical comparison of dependency issues in OSS packaging ecosystems. In2017 IEEE 24th International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 2– 12
work page 2017
-
[11]
Alexandre Decan, Tom Mens, Maëlick Claes, and Philippe Grosjean. 2016. When GitHub meets CRAN: An analysis of inter-repository package dependency prob- lems. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. IEEE, 493–504
work page 2016
-
[12]
Alexandre Decan, Tom Mens, and Eleni Constantinou. 2018. On the impact of security vulnerabilities in the npm package dependency network. In 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) . IEEE, 181–191
work page 2018
-
[13]
Tapajit Dey and Audris Mockus. 2018. Are software dependency supply chain metrics useful in predicting change of popularity of npm packages?. InProceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering. ACM, 66–69
work page 2018
-
[14]
Tapajit Dey and Audris Mockus. 2018. Modeling Relationship between Post- Release Faults and Usage in Mobile Software. In Proceedings of the 14th Interna- tional Conference on Predictive Models and Data Analytics in Software Engineering . ACM, 56–65
work page 2018
-
[15]
Hui Ding, Wanwangying Ma, Lin Chen, Yuming Zhou, and Baowen Xu. 2017. An empirical study on downstream workarounds for cross-project bugs. In 2017 24th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 318–327
work page 2017
-
[16]
Nicolas Ducheneaut. 2005. Socialization in an open source software community: A socio-technical analysis. Computer Supported Cooperative Work (CSCW) 14, 4 (2005), 323–368
work page 2005
-
[17]
Eugene Glynn, Brian Fitzgerald, and Chris Exton. 2005. Commercial adoption of open source software: an empirical study. In 2005 International Symposium on Empirical Software Engineering, 2005. IEEE, 10–pp
work page 2005
-
[18]
Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR ’13) . IEEE Press, Piscataway, NJ, USA, 233–236. http://dl.acm.org/citation.cfm?id=2487085. 2487132
work page 2013
-
[19]
Karim R Lakhani and Eric Von Hippel. 2004. How open source software works:âĂIJfreeâĂİ user-to-user assistance. In Produktentwicklung mit virtuellen Communities. Springer, 303–339
work page 2004
-
[20]
Amanda Lee and Jeffrey C Carver. 2017. Are one-time contributors different? a comparison to core and periphery developers in floss repositories. In 2017 ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM). IEEE, 1–10
work page 2017
-
[21]
Amanda Lee, Jeffrey C Carver, and Amiangshu Bosu. 2017. Understanding the impressions, motivations, and barriers of one time code contributors to FLOSS projects: a survey. In Proceedings of the 39th International Conference on Software Engineering. IEEE Press, 187–197
work page 2017
-
[22]
Wanwangying Ma, Lin Chen, Xiangyu Zhang, Yuming Zhou, and Baowen Xu
-
[23]
In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE)
How do developers fix cross-project correlated bugs? a case study on the GitHub scientific Python ecosystem. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE) . IEEE, 381–392
work page 2017
-
[24]
Yuxing Ma, Chris Bogart, Sadika Amreen, Russell Zaretzki, and Audris Mockus
-
[25]
In IEEE Working Conference on Mining Software Repositories
World of Code: An Infrastructure for Mining the Universe of Open Source VCS Data. In IEEE Working Conference on Mining Software Repositories . papers/ WoC.pdf
-
[26]
Audris Mockus, Roy T Fielding, and James D Herbsleb. 2002. Two case studies of open source software development: Apache and Mozilla. ACM Transactions on Software Engineering and Methodology (TOSEM) 11, 3 (2002), 309–346
work page 2002
-
[27]
Peter C Rigby, Yue Cai Zhu, Samuel M Donadelli, and Audris Mockus. 2016. Quantifying and mitigating turnover-induced knowledge loss: case studies of Chrome and a project at Avaya. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE) . IEEE, 1006–1016
work page 2016
-
[28]
Marat Valiev, Bogdan Vasilescu, and James Herbsleb. 2018. Ecosystem-level determinants of sustained activity in open-source projects: a case study of the pypi ecosystem. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. ACM, 644–655
work page 2018
-
[29]
Eric Von Hippel. 2001. Learning from open-source software. MIT Sloan manage- ment review 42, 4 (2001), 82–86
work page 2001
- [30]
-
[31]
Patrick Wagstrom, Corey Jergensen, and Anita Sarma. 2012. Roles in a networked software development ecosystem: A case study in GitHub. (2012)
work page 2012
-
[32]
Linda Wallace, Mark Keil, and Arun Rai. 2004. Understanding software project risk: a cluster analysis. Information & management 42, 1 (2004), 115–125
work page 2004
-
[33]
Erik Wittern, Philippe Suter, and Shriram Rajagopalan. 2016. A look at the dynamics of the JavaScript package ecosystem. In Mining Software Repositories (MSR), 2016 IEEE/ACM 13th Working Conference on . IEEE, 351–361
work page 2016
-
[34]
Jialiang Xie, Minghui Zhou, and Audris Mockus. 2013. Impact of triage: a study of mozilla and gnome. In 2013 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement . IEEE, 247–250
work page 2013
-
[35]
Rodrigo Elizalde Zapata, Raula Gaikovina Kula, Bodin Chinthanet, Takashi Ishio, Kenichi Matsumoto, and Akinori Ihara. 2018. Towards smoother library migra- tions: A look at vulnerable dependency migrations at function level for npm JavaScript packages. In 2018 IEEE International Conference on Software Mainte- nance and Evolution (ICSME) . IEEE, 559–563
work page 2018
-
[36]
Ahmed Zerouali, Eleni Constantinou, Tom Mens, Gregorio Robles, and Jesús González-Barahona. 2018. An empirical analysis of technical lag in npm package dependencies. In International Conference on Software Reuse . Springer, 95–110
work page 2018
-
[37]
Minghui Zhou, Audris Mockus, Xiujuan Ma, Lu Zhang, and Hong Mei. 2016. Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects. ACM Transactions on Software Engineering and Methodology (TOSEM) 25, 2 (2016), 13
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.