On Using Machine Learning to Identify Knowledge in API Reference Documentation
Pith reviewed 2026-05-24 17:20 UTC · model grok-4.3
The pith
Machine learning can automatically identify specific knowledge types in API reference documentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that conventional machine learning and deep learning classifiers can detect the presence of particular knowledge types from a grounded taxonomy within API reference documentation. When each type is classified separately the best area under the precision-recall curve reaches 87 percent. In the multi-label setting deep learning achieves a macro area under the curve of 79 percent and outperforms both naive baselines and traditional methods. Five of the classifiers generalize from the Java and .NET training data to Python documentation without retraining.
What carries the argument
A collection of binary and multi-label text classifiers (k-nearest neighbors, support vector machines, and deep learning) trained on annotated API documentation to detect each of 12 knowledge types from a grounded taxonomy.
If this is right
- Tools that automatically tag or surface documentation sections by knowledge type become feasible.
- Hybrid models that combine support vector machines and deep learning can be built to cover all knowledge types more evenly.
- Classifiers for Functionality, Concept, Purpose, Pattern, and Directive can be reused across programming languages.
- Pre-trained embeddings from generic or StackOverflow corpora do not yield measurable gains for this task.
Where Pith is reading between the lines
- The same classifiers could be embedded inside integrated development environments to highlight documentation relevant to the current coding task.
- Classification errors on the existing data could be used to refine or extend the original 12-type taxonomy.
- The approach could be applied to other software texts such as tutorials, forum posts, or commit messages.
- Knowledge types that generalize across languages may reflect universal API concepts while others are tied to particular language ecosystems.
Load-bearing premise
The 5,574 manually annotated Java and .NET documentation items supply accurate ground-truth labels that represent the full range of knowledge types and apply to other languages and APIs.
What would settle it
An independently annotated dataset of API documentation from a third language that shows whether the reported accuracies remain stable or drop sharply.
Figures
read the original abstract
Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms na\"ive baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates conventional ML (k-NN, SVM) and deep learning classifiers for identifying 12 knowledge types in API reference documentation. Models are trained on a manually annotated corpus of 5,574 Java and .NET items; performance is reported via per-type AUPRC (up to 87 %) for binary classification and MacroAUC (up to 79 %) for multi-label classification. SVM and DL are shown to be complementary on different types; pre-trained embeddings yield no significant gain; a held-out Python corpus is used to test cross-language generalization, with five types transferring and the rest appearing API-specific.
Significance. If the ground-truth labels are reliable, the work supplies concrete evidence that automated identification of API knowledge types is feasible at useful accuracy levels and that DL and SVM capture complementary signals. The cross-language transfer experiment and the explicit comparison of embedding sources are positive features that strengthen the empirical contribution for tool-building in software engineering.
major comments (2)
- [Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.
- [Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.
minor comments (2)
- [Abstract and Results] The abstract states “we did not observe significant improvements” from StackOverflow embeddings but supplies no p-values or effect-size numbers; the corresponding results paragraph should include them.
- [Methods] Feature extraction details (vectorization, hyper-parameter search, class-imbalance handling) are referenced only at high level; a short methods subsection or appendix table would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.
Authors: We agree that inter-annotator agreement (IAA) statistics strengthen claims about label quality. The 5,574 items were annotated by the first author following the taxonomy validated in prior work, with co-author discussions to resolve ambiguous cases; however, no formal IAA metric was computed at the time. In revision we will expand Section 3 with a detailed annotation protocol description (including conflict resolution via discussion) and explicitly note the absence of IAA as a limitation. Computing full IAA post hoc is not possible without re-annotating a sample, so we treat this as a partial revision. revision: partial
-
Referee: [Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.
Authors: We agree that statistical support for the DL vs. SVM comparison would strengthen the multi-label results. In the revised manuscript we will add bootstrap confidence intervals around the MacroAUC values and include a paired statistical test (e.g., McNemar’s test on per-document predictions or a bootstrap test on the AUC difference) to evaluate whether the observed advantage of deep learning over SVM is statistically significant. revision: yes
Circularity Check
No circularity: empirical ML performance on held-out annotations
full rationale
The paper trains standard classifiers (k-NN, SVM, deep learning) on a manually annotated corpus of 5,574 items and reports direct performance metrics (AUPRC, MacroAUC) on held-out test data plus a separate Python transfer set. No equations, parameter fits presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All headline numbers are computed against external ground-truth labels rather than being forced by the model's own structure or prior author results. This is a standard empirical evaluation whose validity rests on annotation quality, not on any internal reduction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 12 knowledge types identified in prior work form a complete taxonomy suitable for supervised classification of API documentation.
Reference graph
Works this paper leans on
-
[1]
A field study of API learning obstacles,
M. P. Robillard and R. DeLine, “A field study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010
work page 2010
-
[2]
Improving api documentation usability with knowledge pushing,
U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 320–330
work page 2009
-
[3]
Patterns of Knowledge in API Reference Documentation,
W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng., vol. 39, no. 9, pp. 1264–1282, 2013
work page 2013
-
[4]
Discovering information explaining api types using text classi- fication,
G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classi- fication,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 869–879
work page 2015
-
[5]
Recommending reference API documentation,
M. P. Robillard and Y . B. Chhetri, “Recommending reference API documentation,” Empirical Software Engi- neering, vol. 20, no. 6, pp. 1558–1586, Jul. 2014. 14 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT
work page 2014
-
[6]
A case study of api redesign for improved usability,
J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192
work page 2008
-
[7]
The implications of method placement on api learnability,
J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” inProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, 2008, pp. 105–112
work page 2008
-
[8]
What should developers be aware of? An empirical study on the directives of API documentation,
M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011
work page 2011
-
[9]
An observational study on api usage constraints and their documenta- tion,
M. A. Saied, H. Sahraoui, and B. Dufour, “An observational study on api usage constraints and their documenta- tion,” in Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015, pp. 33–42
work page 2015
-
[10]
Deep learning: methods and applications,
L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014
work page 2014
-
[11]
S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997
work page 1997
-
[12]
Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, p. 436, 2015
work page 2015
-
[13]
Distributed representations of words and phrases and their compositionality,
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119
work page 2013
-
[14]
Area under the precision-recall curve: Point estimates and confidence in- tervals,
K. Boyd, K. H. Eng, and C. D. Page, “Area under the precision-recall curve: Point estimates and confidence in- tervals,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2013, pp. 451–466
work page 2013
-
[15]
Using auc and accuracy in evaluating learning algorithms,
J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005
work page 2005
-
[16]
A systematic analysis of performance measures for classification tasks,
M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Infor- mation Processing & Management, vol. 45, no. 4, pp. 427–437, 2009
work page 2009
-
[17]
Concurrence among imbalanced labels and its influence on multilabel resampling algorithms,
F. Charte, A. Rivera, M. J. del Jesus, and F. Herrera, “Concurrence among imbalanced labels and its influence on multilabel resampling algorithms,” in International Conference on Hybrid Artificial Intelligence Systems . Springer, 2014, pp. 110–121
work page 2014
-
[18]
F. Herrera, F. Charte, A. J. Rivera, and M. J. Del Jesus, “Multilabel classification,” in Multilabel Classification. Springer, 2016, pp. 17–31
work page 2016
-
[19]
Glove: Global vectors for word representation
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543
work page 2014
-
[20]
Large-scale learning of word relatedness with constraints,
G. Halawi, G. Dror, E. Gabrilovich, and Y . Koren, “Large-scale learning of word relatedness with constraints,” in KDD. New York, NY , USA: ACM, 2012, pp. 1406–1414. [Online]. Available: http://doi.acm.org/10.1145/2339530.2339751
-
[21]
Text categorization with support vector machines: Learning with many relevant features,
T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Eu- ropean conference on machine learning. Springer, 1998, pp. 137–142
work page 1998
-
[22]
K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Interna- tional conference on database theory. Springer, 1999, pp. 217–235
work page 1999
-
[23]
Natural language processing to quantify security effort in the software development lifecycle
C. A. Cois and R. Kazman, “Natural language processing to quantify security effort in the software development lifecycle.” in SEKE, 2015, pp. 716–721
work page 2015
-
[24]
On the naturalness of software,
A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in 2012 34th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2012, pp. 837–847
work page 2012
-
[25]
Training linear svms in linear time,
T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226
work page 2006
-
[26]
Sentiment polarity detection for software development,
F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Software Engineering, vol. 23, no. 3, pp. 1352–1382, 2018
work page 2018
-
[27]
Easy over hard: A case study on deep learning,
W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 49–60
work page 2017
-
[28]
One-against-all multi-class svm classification using reliability measures,
Y . Liu and Y . F. Zheng, “One-against-all multi-class svm classification using reliability measures,” in Proceed- ings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 849–854. 15 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT
work page 2005
-
[29]
Random search for hyper-parameter optimization,
J. Bergstra and Y . Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012
work page 2012
-
[30]
Nearest neighbor pattern classification,
T. M. Cover, P. E. Hartet al., “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967
work page 1967
-
[31]
Ml-knn: A lazy learning approach to multi-label learning,
M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007
work page 2038
-
[32]
J. Patterson and A. Gibson, Deep Learning: A Practitioner’s Approach. O’Reilly Media, 2017
work page 2017
-
[33]
Dropout: a simple way to prevent neural networks from overfitting
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014. [Online]. Available: http://www.cs.toronto.edu/∼rsalakhu/papers/srivastava14a.pdf
work page 1929
-
[34]
An overview of gradient descent optimization algorithms
S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[35]
Large-scale multi-label text classification— revisiting neural networks,
J. Nam, J. Kim, E. L. Menc ´ıa, I. Gurevych, and J. F ¨urnkranz, “Large-scale multi-label text classification— revisiting neural networks,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2014, pp. 437–452
work page 2014
-
[36]
Contextual correlates of semantic similarity,
G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and cognitive pro- cesses, vol. 6, no. 1, pp. 1–28, 1991
work page 1991
-
[37]
Knowledge-based approaches in software documentation: A systematic literature review,
W. Ding, P. Liang, A. Tang, and H. Van Vliet, “Knowledge-based approaches in software documentation: A systematic literature review,” Information and Software Technology, vol. 56, no. 6, pp. 545–567, 2014
work page 2014
-
[38]
Inferring method specifications from natural language api descriptions,
R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar, “Inferring method specifications from natural language api descriptions,” in Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012, pp. 815–825
work page 2012
-
[39]
B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting semantically linkable knowledge in developer online forums via convolutional neural network,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 51–62
work page 2016
-
[40]
Keep it simple: Is deep learning good for linguistic smell detection?
S. Fakhoury, V . Arnaoudova, C. Noiseux, F. Khomh, and G. Antoniol, “Keep it simple: Is deep learning good for linguistic smell detection?” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 602–611
work page 2018
-
[41]
Natural language or not (nlon)-a package for software engineering text analysis pipeline,
M. M ¨antyl¨a, F. Calefato, and M. Claes, “Natural language or not (nlon)-a package for software engineering text analysis pipeline,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) . IEEE, 2018, pp. 387–391
work page 2018
-
[42]
The psychological meaning of words: Liwc and computerized text analysis methods,
Y . R. Tausczik and J. W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010
work page 2010
-
[43]
On user rationale in software engineering,
Z. Kurtanovi ´c and W. Maalej, “On user rationale in software engineering,” Requirements Engineering, vol. 23, no. 3, pp. 357–379, 2018
work page 2018
-
[44]
Exploring techniques for rationale extraction from existing docu- ments,
B. Rogers, J. Gung, Y . Qiao, and J. E. Burge, “Exploring techniques for rationale extraction from existing docu- ments,” in 2012 34th international conference on software engineering (ICSE). IEEE, 2012, pp. 1313–1316
work page 2012
-
[45]
Replicated softmax: an undirected topic model,
G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in neural information processing systems, 2009, pp. 1607–1614
work page 2009
-
[46]
Generalized cross entropy loss for training deep neural networks with noisy labels,
Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8792–8802
work page 2018
-
[47]
Variants of rmsprop and adagrad with logarithmic regret bounds,
M. C. Mukkamala and M. Hein, “Variants of rmsprop and adagrad with logarithmic regret bounds,” in Proceed- ings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2545–2553
work page 2017
-
[48]
B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the deci- sions of open source contributors,” in Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 2010, pp. 127–136. 16
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.