pith. sign in

arxiv: 1907.09807 · v1 · pith:KPEFQFBNnew · submitted 2019-07-23 · 💻 cs.SE

On Using Machine Learning to Identify Knowledge in API Reference Documentation

Pith reviewed 2026-05-24 17:20 UTC · model grok-4.3

classification 💻 cs.SE
keywords API documentationmachine learningtext classificationknowledge typesdeep learningmulti-label classificationsoftware engineering
0
0 comments X

The pith

Machine learning can automatically identify specific knowledge types in API reference documentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether modern text classification methods can detect which of 12 knowledge types appear in API reference documentation. It trains models on a set of 5,574 manually labeled Java and .NET examples and measures performance both for single types and for combinations of types. Deep learning reaches the highest scores in the multi-label setting while support vector machines perform better on some individual types. Several of the resulting classifiers also work on an unseen Python documentation set. The work explores how such classification could help build tools that let developers find needed information more quickly in dense reference material.

Core claim

The authors establish that conventional machine learning and deep learning classifiers can detect the presence of particular knowledge types from a grounded taxonomy within API reference documentation. When each type is classified separately the best area under the precision-recall curve reaches 87 percent. In the multi-label setting deep learning achieves a macro area under the curve of 79 percent and outperforms both naive baselines and traditional methods. Five of the classifiers generalize from the Java and .NET training data to Python documentation without retraining.

What carries the argument

A collection of binary and multi-label text classifiers (k-nearest neighbors, support vector machines, and deep learning) trained on annotated API documentation to detect each of 12 knowledge types from a grounded taxonomy.

If this is right

  • Tools that automatically tag or surface documentation sections by knowledge type become feasible.
  • Hybrid models that combine support vector machines and deep learning can be built to cover all knowledge types more evenly.
  • Classifiers for Functionality, Concept, Purpose, Pattern, and Directive can be reused across programming languages.
  • Pre-trained embeddings from generic or StackOverflow corpora do not yield measurable gains for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same classifiers could be embedded inside integrated development environments to highlight documentation relevant to the current coding task.
  • Classification errors on the existing data could be used to refine or extend the original 12-type taxonomy.
  • The approach could be applied to other software texts such as tutorials, forum posts, or commit messages.
  • Knowledge types that generalize across languages may reflect universal API concepts while others are tied to particular language ecosystems.

Load-bearing premise

The 5,574 manually annotated Java and .NET documentation items supply accurate ground-truth labels that represent the full range of knowledge types and apply to other languages and APIs.

What would settle it

An independently annotated dataset of API documentation from a third language that shows whether the reported accuracies remain stable or drop sharply.

Figures

Figures reproduced from arXiv: 1907.09807 by Alireza Mollaalizadehbahnemiri, Davide Fucci, Walid Maalej.

Figure 1
Figure 1. Figure 1: A reference documentation page in the JDK API annotated with the knowledge types it contains. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Knowledge types distribution in the CADO dataset after resampling. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Knowledge types distribution in the PYTHON dataset. Two Ph.D. students in software engineering, accustomed to work with Python, manually labelled the knowledge types in each document. For this task, we provided them the same guidelines from Maalej and Robillard3 with small adaptations, such as providing examples using the Python programming language. The agreement on the label set was 14%—i.e., 14 out of t… view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the RNN used for classification of the knowledge types. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A single LSTM recurrent module containing input ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms na\"ive baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper evaluates conventional ML (k-NN, SVM) and deep learning classifiers for identifying 12 knowledge types in API reference documentation. Models are trained on a manually annotated corpus of 5,574 Java and .NET items; performance is reported via per-type AUPRC (up to 87 %) for binary classification and MacroAUC (up to 79 %) for multi-label classification. SVM and DL are shown to be complementary on different types; pre-trained embeddings yield no significant gain; a held-out Python corpus is used to test cross-language generalization, with five types transferring and the rest appearing API-specific.

Significance. If the ground-truth labels are reliable, the work supplies concrete evidence that automated identification of API knowledge types is feasible at useful accuracy levels and that DL and SVM capture complementary signals. The cross-language transfer experiment and the explicit comparison of embedding sources are positive features that strengthen the empirical contribution for tool-building in software engineering.

major comments (2)
  1. [Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.
  2. [Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.
minor comments (2)
  1. [Abstract and Results] The abstract states “we did not observe significant improvements” from StackOverflow embeddings but supplies no p-values or effect-size numbers; the corresponding results paragraph should include them.
  2. [Methods] Feature extraction details (vectorization, hyper-parameter search, class-imbalance handling) are referenced only at high level; a short methods subsection or appendix table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.

    Authors: We agree that inter-annotator agreement (IAA) statistics strengthen claims about label quality. The 5,574 items were annotated by the first author following the taxonomy validated in prior work, with co-author discussions to resolve ambiguous cases; however, no formal IAA metric was computed at the time. In revision we will expand Section 3 with a detailed annotation protocol description (including conflict resolution via discussion) and explicitly note the absence of IAA as a limitation. Computing full IAA post hoc is not possible without re-annotating a sample, so we treat this as a partial revision. revision: partial

  2. Referee: [Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.

    Authors: We agree that statistical support for the DL vs. SVM comparison would strengthen the multi-label results. In the revised manuscript we will add bootstrap confidence intervals around the MacroAUC values and include a paired statistical test (e.g., McNemar’s test on per-document predictions or a bootstrap test on the AUC difference) to evaluate whether the observed advantage of deep learning over SVM is statistically significant. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML performance on held-out annotations

full rationale

The paper trains standard classifiers (k-NN, SVM, deep learning) on a manually annotated corpus of 5,574 items and reports direct performance metrics (AUPRC, MacroAUC) on held-out test data plus a separate Python transfer set. No equations, parameter fits presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All headline numbers are computed against external ground-truth labels rather than being forced by the model's own structure or prior author results. This is a standard empirical evaluation whose validity rests on annotation quality, not on any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that the prior grounded taxonomy is valid and that manual annotations are reliable ground truth; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The 12 knowledge types identified in prior work form a complete taxonomy suitable for supervised classification of API documentation.
    The study builds directly on the taxonomy introduced in previous research without re-deriving or validating its completeness.

pith-pipeline@v0.9.0 · 5836 in / 1253 out tokens · 22032 ms · 2026-05-24T17:20:34.711642+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

  1. [1]

    A field study of API learning obstacles,

    M. P. Robillard and R. DeLine, “A field study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010

  2. [2]

    Improving api documentation usability with knowledge pushing,

    U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 320–330

  3. [3]

    Patterns of Knowledge in API Reference Documentation,

    W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng., vol. 39, no. 9, pp. 1264–1282, 2013

  4. [4]

    Discovering information explaining api types using text classi- fication,

    G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classi- fication,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 869–879

  5. [5]

    Recommending reference API documentation,

    M. P. Robillard and Y . B. Chhetri, “Recommending reference API documentation,” Empirical Software Engi- neering, vol. 20, no. 6, pp. 1558–1586, Jul. 2014. 14 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

  6. [6]

    A case study of api redesign for improved usability,

    J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192

  7. [7]

    The implications of method placement on api learnability,

    J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” inProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, 2008, pp. 105–112

  8. [8]

    What should developers be aware of? An empirical study on the directives of API documentation,

    M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011

  9. [9]

    An observational study on api usage constraints and their documenta- tion,

    M. A. Saied, H. Sahraoui, and B. Dufour, “An observational study on api usage constraints and their documenta- tion,” in Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015, pp. 33–42

  10. [10]

    Deep learning: methods and applications,

    L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014

  11. [11]

    Long short-term memory,

    S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

  12. [12]

    Deep learning,

    Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, p. 436, 2015

  13. [13]

    Distributed representations of words and phrases and their compositionality,

    T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119

  14. [14]

    Area under the precision-recall curve: Point estimates and confidence in- tervals,

    K. Boyd, K. H. Eng, and C. D. Page, “Area under the precision-recall curve: Point estimates and confidence in- tervals,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2013, pp. 451–466

  15. [15]

    Using auc and accuracy in evaluating learning algorithms,

    J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005

  16. [16]

    A systematic analysis of performance measures for classification tasks,

    M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classification tasks,” Infor- mation Processing & Management, vol. 45, no. 4, pp. 427–437, 2009

  17. [17]

    Concurrence among imbalanced labels and its influence on multilabel resampling algorithms,

    F. Charte, A. Rivera, M. J. del Jesus, and F. Herrera, “Concurrence among imbalanced labels and its influence on multilabel resampling algorithms,” in International Conference on Hybrid Artificial Intelligence Systems . Springer, 2014, pp. 110–121

  18. [18]

    Multilabel classification,

    F. Herrera, F. Charte, A. J. Rivera, and M. J. Del Jesus, “Multilabel classification,” in Multilabel Classification. Springer, 2016, pp. 17–31

  19. [19]

    Glove: Global vectors for word representation

    J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543

  20. [20]

    Large-scale learning of word relatedness with constraints,

    G. Halawi, G. Dror, E. Gabrilovich, and Y . Koren, “Large-scale learning of word relatedness with constraints,” in KDD. New York, NY , USA: ACM, 2012, pp. 1406–1414. [Online]. Available: http://doi.acm.org/10.1145/2339530.2339751

  21. [21]

    Text categorization with support vector machines: Learning with many relevant features,

    T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Eu- ropean conference on machine learning. Springer, 1998, pp. 137–142

  22. [22]

    When is “nearest neighbor

    K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Interna- tional conference on database theory. Springer, 1999, pp. 217–235

  23. [23]

    Natural language processing to quantify security effort in the software development lifecycle

    C. A. Cois and R. Kazman, “Natural language processing to quantify security effort in the software development lifecycle.” in SEKE, 2015, pp. 716–721

  24. [24]

    On the naturalness of software,

    A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in 2012 34th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2012, pp. 837–847

  25. [25]

    Training linear svms in linear time,

    T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226

  26. [26]

    Sentiment polarity detection for software development,

    F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Software Engineering, vol. 23, no. 3, pp. 1352–1382, 2018

  27. [27]

    Easy over hard: A case study on deep learning,

    W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 49–60

  28. [28]

    One-against-all multi-class svm classification using reliability measures,

    Y . Liu and Y . F. Zheng, “One-against-all multi-class svm classification using reliability measures,” in Proceed- ings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 849–854. 15 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

  29. [29]

    Random search for hyper-parameter optimization,

    J. Bergstra and Y . Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012

  30. [30]

    Nearest neighbor pattern classification,

    T. M. Cover, P. E. Hartet al., “Nearest neighbor pattern classification,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967

  31. [31]

    Ml-knn: A lazy learning approach to multi-label learning,

    M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007

  32. [32]

    Patterson and A

    J. Patterson and A. Gibson, Deep Learning: A Practitioner’s Approach. O’Reilly Media, 2017

  33. [33]

    Dropout: a simple way to prevent neural networks from overfitting

    N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014. [Online]. Available: http://www.cs.toronto.edu/∼rsalakhu/papers/srivastava14a.pdf

  34. [34]

    An overview of gradient descent optimization algorithms

    S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

  35. [35]

    Large-scale multi-label text classification— revisiting neural networks,

    J. Nam, J. Kim, E. L. Menc ´ıa, I. Gurevych, and J. F ¨urnkranz, “Large-scale multi-label text classification— revisiting neural networks,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2014, pp. 437–452

  36. [36]

    Contextual correlates of semantic similarity,

    G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and cognitive pro- cesses, vol. 6, no. 1, pp. 1–28, 1991

  37. [37]

    Knowledge-based approaches in software documentation: A systematic literature review,

    W. Ding, P. Liang, A. Tang, and H. Van Vliet, “Knowledge-based approaches in software documentation: A systematic literature review,” Information and Software Technology, vol. 56, no. 6, pp. 545–567, 2014

  38. [38]

    Inferring method specifications from natural language api descriptions,

    R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar, “Inferring method specifications from natural language api descriptions,” in Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012, pp. 815–825

  39. [39]

    Predicting semantically linkable knowledge in developer online forums via convolutional neural network,

    B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting semantically linkable knowledge in developer online forums via convolutional neural network,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 51–62

  40. [40]

    Keep it simple: Is deep learning good for linguistic smell detection?

    S. Fakhoury, V . Arnaoudova, C. Noiseux, F. Khomh, and G. Antoniol, “Keep it simple: Is deep learning good for linguistic smell detection?” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 602–611

  41. [41]

    Natural language or not (nlon)-a package for software engineering text analysis pipeline,

    M. M ¨antyl¨a, F. Calefato, and M. Claes, “Natural language or not (nlon)-a package for software engineering text analysis pipeline,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) . IEEE, 2018, pp. 387–391

  42. [42]

    The psychological meaning of words: Liwc and computerized text analysis methods,

    Y . R. Tausczik and J. W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010

  43. [43]

    On user rationale in software engineering,

    Z. Kurtanovi ´c and W. Maalej, “On user rationale in software engineering,” Requirements Engineering, vol. 23, no. 3, pp. 357–379, 2018

  44. [44]

    Exploring techniques for rationale extraction from existing docu- ments,

    B. Rogers, J. Gung, Y . Qiao, and J. E. Burge, “Exploring techniques for rationale extraction from existing docu- ments,” in 2012 34th international conference on software engineering (ICSE). IEEE, 2012, pp. 1313–1316

  45. [45]

    Replicated softmax: an undirected topic model,

    G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in neural information processing systems, 2009, pp. 1607–1614

  46. [46]

    Generalized cross entropy loss for training deep neural networks with noisy labels,

    Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8792–8802

  47. [47]

    Variants of rmsprop and adagrad with logarithmic regret bounds,

    M. C. Mukkamala and M. Hein, “Variants of rmsprop and adagrad with logarithmic regret bounds,” in Proceed- ings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2545–2553

  48. [48]

    Creating and evolving developer documentation: understanding the deci- sions of open source contributors,

    B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the deci- sions of open source contributors,” in Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 2010, pp. 127–136. 16