What Do Developers Ask About ML Libraries? A Large-scale Study Using Stack Overflow
Pith reviewed 2026-05-25 14:13 UTC · model grok-4.3
The pith
Analysis of 3,243 Stack Overflow posts on ten ML libraries shows static and dynamic analyses are absent and API misuses are common.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our findings reveal the urgent need for software engineering research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.
What carries the argument
Manual classification of questions into seven stages of an ML pipeline followed by statistical analysis across libraries and time periods.
If this is right
- Static and dynamic analysis techniques must be developed specifically for ML library usage to catch errors before runtime.
- Debugging support for ML systems requires substantially more research attention.
- Redesign of ML library APIs is needed to reduce the rate of misuses observed in the questions.
- Approaches that reconcile high-level abstractions with visibility into trained model internals should be explored.
Where Pith is reading between the lines
- Library maintainers could instrument their code with additional checks at the training and evaluation stages where questions cluster.
- The observed tension between abstraction and model transparency may affect adoption rates of newer high-level ML frameworks.
- Educational materials and documentation for ML libraries should prioritize the pipeline stages that generate the most questions.
Load-bearing premise
The 3,243 highly-rated Q&A posts selected from Stack Overflow are representative of the difficulties faced by software developers when learning about and using ML libraries in their systems.
What would settle it
A follow-up survey or interview study of practicing ML developers that finds their most common problems do not match the distribution of stages and error types identified in the Stack Overflow posts.
Figures
read the original abstract
Modern software systems are increasingly including machine learning (ML) as an integral component. However, we do not yet understand the difficulties faced by software developers when learning about ML libraries and using them within their systems. To that end, this work reports on a detailed (manual) examination of 3,243 highly-rated Q&A posts related to ten ML libraries, namely Tensorflow, Keras, scikit-learn, Weka, Caffe, Theano, MLlib, Torch, Mahout, and H2O, on Stack Overflow, a popular online technical Q&A forum. We classify these questions into seven typical stages of an ML pipeline to understand the correlation between the library and the stage. Then we study the questions and perform statistical analysis to explore the answer to four research objectives (finding the most difficult stage, understanding the nature of problems, nature of libraries and studying whether the difficulties stayed consistent over time). Our findings reveal the urgent need for software engineering (SE) research in this area. Both static and dynamic analyses are mostly absent and badly needed to help developers find errors earlier. While there has been some early research on debugging, much more work is needed. API misuses are prevalent and API design improvements are sorely needed. Last and somewhat surprisingly, a tug of war between providing higher levels of abstractions and the need to understand the behavior of the trained model is prevalent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports a manual examination and statistical analysis of 3,243 highly-rated Stack Overflow Q&A posts across ten ML libraries (TensorFlow, Keras, scikit-learn, etc.). Posts are classified into seven stages of a typical ML pipeline; the authors then address four research objectives on the most difficult stage, nature of problems, library differences, and temporal stability, concluding that static/dynamic analyses are absent, API misuses are prevalent, and API design improvements plus further SE research are urgently needed.
Significance. If the classification process proves reliable and the highly-rated SO sample is representative, the work supplies concrete evidence of tooling gaps at the SE-ML boundary and could usefully guide priorities for static analysis, debugging support, and API usability research. The multi-library scope and pipeline-stage framing are strengths that would make the findings actionable for both researchers and library maintainers.
major comments (2)
- [Methodology (data collection and classification)] Methodology section (data collection and classification): the abstract and text describe manual examination and assignment to seven pipeline stages but supply no information on inter-rater reliability, how the seven stages themselves were validated or pilot-tested, or the precise exclusion criteria applied to arrive at the final 3,243 posts. These omissions directly affect the soundness of every subsequent statistical claim and the identification of 'most difficult' stages.
- [Results and Discussion] Results and Discussion sections: the central extrapolation that 'both static and dynamic analyses are mostly absent and badly needed' and that 'API misuses are prevalent' rests on the assumption that the selected highly-rated SO posts represent the difficulties faced by developers in general. No cross-validation against GitHub issues, developer surveys, or usage telemetry is reported, leaving the generalization load-bearing for the 'urgent need' conclusion.
minor comments (2)
- [Abstract] Abstract: the phrase 'highly-rated' is used without stating the exact rating threshold or vote count applied during selection.
- [Introduction / Research Objectives] The description of the four research objectives would benefit from explicit mapping to the statistical tests or tables that address each one.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: Methodology section (data collection and classification): the abstract and text describe manual examination and assignment to seven pipeline stages but supply no information on inter-rater reliability, how the seven stages themselves were validated or pilot-tested, or the precise exclusion criteria applied to arrive at the final 3,243 posts. These omissions directly affect the soundness of every subsequent statistical claim and the identification of 'most difficult' stages.
Authors: We agree that the methodology section would benefit from greater transparency. The seven pipeline stages were derived from standard descriptions in the ML and SE literature (e.g., data preparation, model training, evaluation). Exclusion criteria included posts tagged with the ten libraries, having an accepted answer, and a minimum score threshold to focus on highly-rated content; non-English posts and duplicates were removed. Classification was performed by the first two authors, with disagreements resolved via discussion until consensus. No formal inter-rater reliability statistic (e.g., Cohen's kappa) was computed. We will expand the methodology section with explicit stage definitions, a description of the pilot phase used to refine the stages, the exact exclusion rules, and the consensus process. revision: yes
-
Referee: Results and Discussion sections: the central extrapolation that 'both static and dynamic analyses are mostly absent and badly needed' and that 'API misuses are prevalent' rests on the assumption that the selected highly-rated SO posts represent the difficulties faced by developers in general. No cross-validation against GitHub issues, developer surveys, or usage telemetry is reported, leaving the generalization load-bearing for the 'urgent need' conclusion.
Authors: The study is explicitly scoped to highly-rated Stack Overflow posts, which serve as a public record of developer difficulties that have been vetted by the community through votes and answers. This source is commonly used in empirical SE research on API usage and learning barriers. We acknowledge that the absence of triangulation with GitHub issues or surveys limits the strength of broad generalizations. We will revise the discussion and threats-to-validity sections to (a) more precisely bound the claims to the SO dataset and (b) explicitly call for future multi-source validation studies. revision: partial
Circularity Check
No circularity: purely observational empirical classification study
full rationale
The paper conducts a manual examination and classification of 3,243 Stack Overflow posts into ML pipeline stages, followed by statistical analysis of observed patterns (e.g., prevalent API misuses). No equations, fitted parameters renamed as predictions, self-citation chains, uniqueness theorems, or ansatzes are present. All claims derive directly from the selected data without reduction to inputs by construction. Representativeness concerns affect external validity but do not create circularity in the reported derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The seven typical stages of an ML pipeline form a valid and exhaustive categorization for classifying developer questions.
Reference graph
Works this paper leans on
-
[1]
Top 15 Frameworks for Machine Learning Experts,
kdnuggets, “Top 15 Frameworks for Machine Learning Experts,” 2016, https://www.kdnuggets.com/2016/04/ top-15-frameworks-machine-learning-experts.html
work page 2016
-
[2]
Machine learning: The high-interest credit card of technical debt,
D. Sculley, T. Phillips, D. Ebner, V . Chaudhary, and M. Young, “Machine learning: The high-interest credit card of technical debt,” 2014
work page 2014
-
[3]
Hidden technical debt in machine learning systems,
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V . Chaudhary, M. Young, J.-F. Crespo, and D. Dennison, “Hidden technical debt in machine learning systems,” in Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2 , ser. NIPS’15. Cambridge, MA, USA: MIT Press, 2015, pp. 2503–2511. [Online]...
-
[4]
What’s your ml test score? a rubric for ml production systems,
E. Breck, S. Cai, E. Nielsen, M. Salib, and D. Sculley, “What’s your ml test score? a rubric for ml production systems,” in NIPS Workshop on Reliable Machine Learning in the Wild , 2016
work page 2016
-
[5]
How do programmers ask and answer questions on the web?: Nier track,
C. Treude, O. Barzilay, and M.-A. Storey, “How do programmers ask and answer questions on the web?: Nier track,” in Software Engineering (ICSE), 2011 33rd International Conference on . IEEE, 2011, pp. 804–807
work page 2011
-
[6]
An empirical study on developer interactions in stackoverflow,
S. Wang, D. Lo, and L. Jiang, “An empirical study on developer interactions in stackoverflow,” in Proceedings of the 28th Annual ACM Symposium on Applied Computing . ACM, 2013, pp. 1019– 1024
work page 2013
-
[7]
Sparrows and owls: Characterisation of expert behaviour in stackoverflow,
J. Yang, K. Tao, A. Bozzon, and G.-J. Houben, “Sparrows and owls: Characterisation of expert behaviour in stackoverflow,” in Interna- tional Conference on User Modeling, Adaptation, and Personalization . Springer, 2014, pp. 266–277
work page 2014
-
[8]
Using and asking: APIs used in the android market and asked about in stackoverflow,
D. Kavaler, D. Posnett, C. Gibler, H. Chen, P . Devanbu, and V . Filkov, “Using and asking: APIs used in the android market and asked about in stackoverflow,” in International Conference on Social Informatics. Springer, 2013, pp. 405–418
work page 2013
-
[9]
How do API changes trigger stack overflow discussions? a study on the android sdk,
M. Linares-V ´asquez, G. Bavota, M. Di Penta, R. Oliveto, and D. Poshyvanyk, “How do API changes trigger stack overflow discussions? a study on the android sdk,” in proceedings of the 22nd International Conference on Program Comprehension . ACM, 2014, pp. 83–94
work page 2014
-
[10]
Selecting best answer: An empirical analysis on community question answering sites,
T. P . Sahu, N. K. Nagwani, and S. Verma, “Selecting best answer: An empirical analysis on community question answering sites,” IEEE Access, vol. 4, pp. 4797–4808, 2016
work page 2016
-
[11]
What are developers talking about? an analysis of topics and trends in stack overflow,
A. Barua, S. W. Thomas, and A. E. Hassan, “What are developers talking about? an analysis of topics and trends in stack overflow,” Empirical Software Engineering, vol. 19, no. 3, pp. 619–654, 2014
work page 2014
-
[12]
Detecting api usage obstacles: A study of ios and android developer questions,
W. Wang and M. W. Godfrey, “Detecting api usage obstacles: A study of ios and android developer questions,” in Proceedings of the 10th Working Conference on Mining Software Repositories . IEEE Press, 2013, pp. 61–64
work page 2013
-
[13]
An empirical study on the usage of the swift program- ming language,
M. Rebouc ¸as, G. Pinto, F. Ebert, W. Torres, A. Serebrenik, and F. Castor, “An empirical study on the usage of the swift program- ming language,” in Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, vol. 1. IEEE, 2016, pp. 634–638
work page 2016
-
[14]
Evaluating bug severity using crowd-based knowledge: An exploratory study,
Y. Zhang, G. Yin, T. Wang, Y. Yu, and H. Wang, “Evaluating bug severity using crowd-based knowledge: An exploratory study,” in Proceedings of the 7th Asia-Pacific Symposium on Internetware. ACM, 2015, pp. 70–73
work page 2015
-
[15]
Geo-locating the knowledge transfer in stackoverflow,
D. Schenk and M. Lungu, “Geo-locating the knowledge transfer in stackoverflow,” in Proceedings of the 2013 International Workshop on Social Software Engineering. ACM, 2013, pp. 21–24
work page 2013
-
[16]
Predicting tags for stackoverflow posts,
C. Stanley and M. D. Byrne, “Predicting tags for stackoverflow posts,” in Proceedings of ICCM, vol. 2013, 2013
work page 2013
-
[17]
An empirical study of api stability and adoption in the android ecosystem,
T. McDonnell, B. Ray, and M. Kim, “An empirical study of api stability and adoption in the android ecosystem,” in Software Maintenance (ICSM), 2013 29th IEEE International Conference on . IEEE, 2013, pp. 70–79
work page 2013
-
[18]
Predicting the quality of questions on stackoverflow,
A. Baltadzhieva and G. Chrupała, “Predicting the quality of questions on stackoverflow,” in Proceedings of the International Conference Recent Advances in Natural Language Processing, 2015, pp. 32–40
work page 2015
-
[19]
A. Joorabchi, M. English, and A. E. Mahdi, “Text mining stackover- flow: An insight into challenges and subject-related difficulties faced by computer science learners,” Journal of Enterprise Informa- tion Management, vol. 29, no. 2, pp. 255–275, 2016
work page 2016
-
[20]
Caffe: Convolutional architecture for fast feature embedding,
Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in Proceedings of the 22nd ACM inter- national conference on Multimedia. ACM, 2014, pp. 675–678
work page 2014
-
[21]
A. Candel, V . Parmar, E. LeDell, and A. Arora, “Deep learning with h2o,” H2O. ai Inc, 2016
work page 2016
-
[22]
Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015
F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015
work page 2015
- [23]
-
[24]
Mllib: Machine learning in apache spark,
X. Meng, J. Bradley, B. Yavuz, E. Sparks, S. Venkataraman, D. Liu, J. Freeman, D. Tsai, M. Amde, S. Owen et al. , “Mllib: Machine learning in apache spark,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1235–1241, 2016
work page 2016
-
[25]
Scikit-learn: Machine learning in python,
F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P . Prettenhofer, R. Weiss, V . Dubourget al., “Scikit-learn: Machine learning in python,” Journal of machine learning research, vol. 12, no. Oct, pp. 2825–2830, 2011
work page 2011
-
[26]
Tensorflow: A system for large-scale machine learning
M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283
work page 2016
-
[27]
Theano: Deep learning on gpus with python,
J. Bergstra, F. Bastien, O. Breuleux, P . Lamblin, R. Pascanu, O. De- lalleau, G. Desjardins, D. Warde-Farley, I. Goodfellow, A. Bergeron et al., “Theano: Deep learning on gpus with python,” in NIPS 2011, BigLearning Workshop, Granada, Spain, vol. 3. Citeseer, 2011
work page 2011
-
[28]
Torch: a modular machine learning software library,
R. Collobert, S. Bengio, and J. Mari ´ethoz, “Torch: a modular machine learning software library,” Idiap, Tech. Rep., 2002
work page 2002
-
[29]
Weka: A machine learn- ing workbench,
G. Holmes, A. Donkin, and I. H. Witten, “Weka: A machine learn- ing workbench,” in Intelligent Information Systems, 1994. Proceedings of the 1994 Second Australian and New Zealand Conference on. IEEE, 1994, pp. 357–361
work page 1994
-
[30]
API design for machine learning software: experiences from the scikit- learn project,
L. Buitinck, G. Louppe, M. Blondel, F. Pedregosa, A. Mueller, O. Grisel, V . Niculae, P . Prettenhofer, A. Gramfort, J. Grobler, R. Layton, J. VanderPlas, A. Joly, B. Holt, and G. Varoquaux, “API design for machine learning software: experiences from the scikit- learn project,” in ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013, p...
work page 2013
-
[31]
Various frameworks and libraries of machine learning and deep learning: A survey,
Z. Wang, K. Liu, J. Li, Y. Zhu, and Y. Zhang, “Various frameworks and libraries of machine learning and deep learning: A survey,” Archives of Computational Methods in Engineering , pp. 1–24, 2019
work page 2019
-
[32]
Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey,
G. Nguyen, S. Dlugolinsky, M. Bob ´ak, V . Tran, ´A. L. Garc ´ıa, I. Heredia, P . Mal´ık, and L. Hluch `y, “Machine learning and deep learning frameworks and libraries for large-scale data mining: a survey,” Artificial Intelligence Review, pp. 1–48, 2019
work page 2019
-
[33]
Crowd- sourced knowledge on stack overflow: A systematic mapping study,
S. Meldrum, S. A. Licorish, and B. T. R. Savarimuthu, “Crowd- sourced knowledge on stack overflow: A systematic mapping study,” in Proceedings of the 21st International Conference on Eval- uation and Assessment in Software Engineering . ACM, 2017, pp. 180–185
work page 2017
-
[34]
The 7 Steps of Machine Learn- ing,
Yufeng Guo, “The 7 Steps of Machine Learn- ing,” 2017, https://towardsdatascience.com/ the-7-steps-of-machine-learning-2877d7e5548e
work page 2017
-
[35]
S. Lockyer, “Coding qualitative data,” The Sage encyclopedia of social science research methods, vol. 1, no. 1, pp. 137–138, 2004
work page 2004
- [36]
-
[37]
A. Strauss and J. Corbin, Basics of qualitative research . Sage publications, 1990
work page 1990
-
[38]
Computing inter-rater reliability for observational data: an overview and tutorial,
K. A. Hallgren, “Computing inter-rater reliability for observational data: an overview and tutorial,” Tutorials in quantitative methods for psychology, vol. 8, no. 1, p. 23, 2012
work page 2012
-
[39]
A machine learning pipeline for quantitative phenotype prediction from genotype data,
G. Guzzetta, G. Jurman, and C. Furlanello, “A machine learning pipeline for quantitative phenotype prediction from genotype data,” BMC bioinformatics, vol. 11, no. 8, p. S3, 2010. 13
work page 2010
-
[40]
What are mobile developers asking about? a large scale study using stack overflow,
C. Rosen and E. Shihab, “What are mobile developers asking about? a large scale study using stack overflow,” Empirical Software Md Johirul Islam is a doctoral candidate at Iowa State University. His research interests in- clude machine learning program analysis, soft- ware techniques for machine learning, and pro- gramming languages. He has published works...
work page 2016
-
[41]
Software engi- neering for machine learning: a case study,
S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann, “Software engi- neering for machine learning: a case study,” in Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice. IEEE Press, 2019, pp. 291–300
work page 2019
-
[42]
Using Caffe with your own dataset,
Alexandr Honchar, “Using Caffe with your own dataset,” 2017, https://medium.com/machine-learning-world/ using-caffe-with-your-own-dataset-b0ade5d71233
work page 2017
-
[43]
Debugging Machine Learning Tasks
A. Chakarov, A. Nori, S. Rajamani, S. Sen, and D. Vijay- keerthy, “Debugging machine learning tasks,” arXiv preprint arXiv:1603.07292, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[44]
Debugging TensorFlow Programs,
Tensorflow, “Debugging TensorFlow Programs,” 2016, https:// www.tensorflow.org/programmers guide/debugger
work page 2016
-
[45]
Effects of loss functions and target represen- tations on adversarial robustness,
S. Saito and S. Roy, “Effects of loss functions and target represen- tations on adversarial robustness,” arXiv preprint arXiv:1812.00181, 2018
-
[46]
Mubench: A benchmark for api-misuse detectors,
S. Amann, S. Nadi, H. A. Nguyen, T. N. Nguyen, and M. Mezini, “Mubench: A benchmark for api-misuse detectors,” in Proceedings of the 13th International Conference on Mining Software Repositories , ser. MSR ’16. New York, NY, USA: ACM, 2016, pp. 464–467. [Online]. Available: http://doi.acm.org/10.1145/2901739.2903506
-
[47]
On the kolmogorov-smirnov test for normality with mean and variance unknown,
H. W. Lilliefors, “On the kolmogorov-smirnov test for normality with mean and variance unknown,” Journal of the American statis- tical Association, vol. 62, no. 318, pp. 399–402, 1967
work page 1967
-
[48]
A quick view on current techniques and machine learning algorithms for big data analytics,
J. L. Berral-Garc ´ıa, “A quick view on current techniques and machine learning algorithms for big data analytics,” in 2016 18th international conference on transparent optical networks (ICTON). IEEE, 2016, pp. 1–4. Hridesh Rajan is the Kingland Professor in the Computer Science Department at Iowa State University (ISU) where he has been since 2005. His r...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.