On Using Machine Learning to Identify Knowledge in API Reference Documentation

Alireza Mollaalizadehbahnemiri; Davide Fucci; Walid Maalej

arxiv: 1907.09807 · v1 · pith:KPEFQFBNnew · submitted 2019-07-23 · 💻 cs.SE

On Using Machine Learning to Identify Knowledge in API Reference Documentation

Davide Fucci , Alireza Mollaalizadehbahnemiri , Walid Maalej This is my paper

Pith reviewed 2026-05-24 17:20 UTC · model grok-4.3

classification 💻 cs.SE

keywords API documentationmachine learningtext classificationknowledge typesdeep learningmulti-label classificationsoftware engineering

0 comments

The pith

Machine learning can automatically identify specific knowledge types in API reference documentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether modern text classification methods can detect which of 12 knowledge types appear in API reference documentation. It trains models on a set of 5,574 manually labeled Java and .NET examples and measures performance both for single types and for combinations of types. Deep learning reaches the highest scores in the multi-label setting while support vector machines perform better on some individual types. Several of the resulting classifiers also work on an unseen Python documentation set. The work explores how such classification could help build tools that let developers find needed information more quickly in dense reference material.

Core claim

The authors establish that conventional machine learning and deep learning classifiers can detect the presence of particular knowledge types from a grounded taxonomy within API reference documentation. When each type is classified separately the best area under the precision-recall curve reaches 87 percent. In the multi-label setting deep learning achieves a macro area under the curve of 79 percent and outperforms both naive baselines and traditional methods. Five of the classifiers generalize from the Java and .NET training data to Python documentation without retraining.

What carries the argument

A collection of binary and multi-label text classifiers (k-nearest neighbors, support vector machines, and deep learning) trained on annotated API documentation to detect each of 12 knowledge types from a grounded taxonomy.

If this is right

Tools that automatically tag or surface documentation sections by knowledge type become feasible.
Hybrid models that combine support vector machines and deep learning can be built to cover all knowledge types more evenly.
Classifiers for Functionality, Concept, Purpose, Pattern, and Directive can be reused across programming languages.
Pre-trained embeddings from generic or StackOverflow corpora do not yield measurable gains for this task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same classifiers could be embedded inside integrated development environments to highlight documentation relevant to the current coding task.
Classification errors on the existing data could be used to refine or extend the original 12-type taxonomy.
The approach could be applied to other software texts such as tutorials, forum posts, or commit messages.
Knowledge types that generalize across languages may reflect universal API concepts while others are tied to particular language ecosystems.

Load-bearing premise

The 5,574 manually annotated Java and .NET documentation items supply accurate ground-truth labels that represent the full range of knowledge types and apply to other languages and APIs.

What would settle it

An independently annotated dataset of API documentation from a third language that shows whether the reported accuracies remain stable or drop sharply.

Figures

Figures reproduced from arXiv: 1907.09807 by Alireza Mollaalizadehbahnemiri, Davide Fucci, Walid Maalej.

**Figure 2.** Figure 2: Knowledge types distribution in the CADO dataset after resampling. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Knowledge types distribution in the PYTHON dataset. Two Ph.D. students in software engineering, accustomed to work with Python, manually labelled the knowledge types in each document. For this task, we provided them the same guidelines from Maalej and Robillard3 with small adaptations, such as providing examples using the Python programming language. The agreement on the label set was 14%—i.e., 14 out of t… view at source ↗

**Figure 4.** Figure 4: Architecture of the RNN used for classification of the knowledge types. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: A single LSTM recurrent module containing input ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Using API reference documentation like JavaDoc is an integral part of software development. Previous research introduced a grounded taxonomy that organizes API documentation knowledge in 12 types, including knowledge about the Functionality, Structure, and Quality of an API. We study how well modern text classification approaches can automatically identify documentation containing specific knowledge types. We compared conventional machine learning (k-NN and SVM) and deep learning approaches trained on manually annotated Java and .NET API documentation (n = 5,574). When classifying the knowledge types individually (i.e., multiple binary classifiers) the best AUPRC was up to 87%. The deep learning and SVM classifiers seem complementary. For four knowledge types (Concept, Control, Pattern, and Non-Information), SVM clearly outperforms deep learning which, on the other hand, is more accurate for identifying the remaining types. When considering multiple knowledge types at once (i.e., multi-label classification) deep learning outperforms na\"ive baselines and traditional machine learning achieving a MacroAUC up to 79%. We also compared classifiers using embeddings pre-trained on generic text corpora and StackOverflow but did not observe significant improvements. Finally, to assess the generalizability of the classifiers, we re-tested them on a different, unseen Python documentation dataset. Classifiers for Functionality, Concept, Purpose, Pattern, and Directive seem to generalize from Java and .NET to Python documentation. The accuracy related to the remaining types seems API-specific. We discuss our results and how they inform the development of tools for supporting developers sharing and accessing API knowledge. Published article: https://doi.org/10.1145/3338906.3338943

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows DL and SVM are complementary on the 12-type API taxonomy with some Python transfer, but the unquantified label quality is the main limit on what the numbers mean.

read the letter

The main thing to know is that this work takes an existing 12-type taxonomy for API reference docs and runs standard binary and multi-label classifiers on 5,574 manually labeled Java and .NET items, reaching AUPRC up to 87% and MacroAUC up to 79%. It adds a complementarity check between SVM and deep learning plus a cross-language test on Python docs that earlier papers on the taxonomy did not include. Some types (Functionality, Concept, Purpose, Pattern, Directive) appear to transfer; others look language-specific. Embeddings from StackOverflow did not help much over generic ones.

Referee Report

2 major / 2 minor

Summary. The paper evaluates conventional ML (k-NN, SVM) and deep learning classifiers for identifying 12 knowledge types in API reference documentation. Models are trained on a manually annotated corpus of 5,574 Java and .NET items; performance is reported via per-type AUPRC (up to 87 %) for binary classification and MacroAUC (up to 79 %) for multi-label classification. SVM and DL are shown to be complementary on different types; pre-trained embeddings yield no significant gain; a held-out Python corpus is used to test cross-language generalization, with five types transferring and the rest appearing API-specific.

Significance. If the ground-truth labels are reliable, the work supplies concrete evidence that automated identification of API knowledge types is feasible at useful accuracy levels and that DL and SVM capture complementary signals. The cross-language transfer experiment and the explicit comparison of embedding sources are positive features that strengthen the empirical contribution for tool-building in software engineering.

major comments (2)

[Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.
[Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.

minor comments (2)

[Abstract and Results] The abstract states “we did not observe significant improvements” from StackOverflow embeddings but supplies no p-values or effect-size numbers; the corresponding results paragraph should include them.
[Methods] Feature extraction details (vectorization, hyper-parameter search, class-imbalance handling) are referenced only at high level; a short methods subsection or appendix table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Section 3] Dataset construction / annotation protocol (Section 3): the manuscript provides no inter-annotator agreement statistic (Cohen’s κ, Fleiss’ κ, or equivalent) nor a description of how conflicts among the 12 taxonomy labels were resolved on the 5,574 items. Because every reported AUPRC and MacroAUC value is computed against these labels, the absence of reliability evidence is load-bearing for the central performance claims.

Authors: We agree that inter-annotator agreement (IAA) statistics strengthen claims about label quality. The 5,574 items were annotated by the first author following the taxonomy validated in prior work, with co-author discussions to resolve ambiguous cases; however, no formal IAA metric was computed at the time. In revision we will expand Section 3 with a detailed annotation protocol description (including conflict resolution via discussion) and explicitly note the absence of IAA as a limitation. Computing full IAA post hoc is not possible without re-annotating a sample, so we treat this as a partial revision. revision: partial
Referee: [Results section (multi-label table)] Results, multi-label experiment (Table 4 or equivalent): the claim that deep learning “outperforms naïve baselines and traditional machine learning” reaching MacroAUC 79 % is presented without statistical significance tests or confidence intervals on the difference versus SVM. This weakens the comparative conclusion that is used to motivate tool development.

Authors: We agree that statistical support for the DL vs. SVM comparison would strengthen the multi-label results. In the revised manuscript we will add bootstrap confidence intervals around the MacroAUC values and include a paired statistical test (e.g., McNemar’s test on per-document predictions or a bootstrap test on the AUC difference) to evaluate whether the observed advantage of deep learning over SVM is statistically significant. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML performance on held-out annotations

full rationale

The paper trains standard classifiers (k-NN, SVM, deep learning) on a manually annotated corpus of 5,574 items and reports direct performance metrics (AUPRC, MacroAUC) on held-out test data plus a separate Python transfer set. No equations, parameter fits presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. All headline numbers are computed against external ground-truth labels rather than being forced by the model's own structure or prior author results. This is a standard empirical evaluation whose validity rests on annotation quality, not on any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation rests on the assumption that the prior grounded taxonomy is valid and that manual annotations are reliable ground truth; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The 12 knowledge types identified in prior work form a complete taxonomy suitable for supervised classification of API documentation.
The study builds directly on the taxonomy introduced in previous research without re-deriving or validating its completeness.

pith-pipeline@v0.9.0 · 5836 in / 1253 out tokens · 22032 ms · 2026-05-24T17:20:34.711642+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

A ﬁeld study of API learning obstacles,

M. P. Robillard and R. DeLine, “A ﬁeld study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010

work page 2010
[2]

Improving api documentation usability with knowledge pushing,

U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 320–330

work page 2009
[3]

Patterns of Knowledge in API Reference Documentation,

W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng., vol. 39, no. 9, pp. 1264–1282, 2013

work page 2013
[4]

Discovering information explaining api types using text classi- ﬁcation,

G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classi- ﬁcation,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 869–879

work page 2015
[5]

Recommending reference API documentation,

M. P. Robillard and Y . B. Chhetri, “Recommending reference API documentation,” Empirical Software Engi- neering, vol. 20, no. 6, pp. 1558–1586, Jul. 2014. 14 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

work page 2014
[6]

A case study of api redesign for improved usability,

J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192

work page 2008
[7]

The implications of method placement on api learnability,

J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” inProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, 2008, pp. 105–112

work page 2008
[8]

What should developers be aware of? An empirical study on the directives of API documentation,

M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011

work page 2011
[9]

An observational study on api usage constraints and their documenta- tion,

M. A. Saied, H. Sahraoui, and B. Dufour, “An observational study on api usage constraints and their documenta- tion,” in Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015, pp. 33–42

work page 2015
[10]

Deep learning: methods and applications,

L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014

work page 2014
[11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[12]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, p. 436, 2015

work page 2015
[13]

Distributed representations of words and phrases and their compositionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119

work page 2013
[14]

Area under the precision-recall curve: Point estimates and conﬁdence in- tervals,

K. Boyd, K. H. Eng, and C. D. Page, “Area under the precision-recall curve: Point estimates and conﬁdence in- tervals,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2013, pp. 451–466

work page 2013
[15]

Using auc and accuracy in evaluating learning algorithms,

J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005

work page 2005
[16]

A systematic analysis of performance measures for classiﬁcation tasks,

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classiﬁcation tasks,” Infor- mation Processing & Management, vol. 45, no. 4, pp. 427–437, 2009

work page 2009
[17]

Concurrence among imbalanced labels and its inﬂuence on multilabel resampling algorithms,

F. Charte, A. Rivera, M. J. del Jesus, and F. Herrera, “Concurrence among imbalanced labels and its inﬂuence on multilabel resampling algorithms,” in International Conference on Hybrid Artiﬁcial Intelligence Systems . Springer, 2014, pp. 110–121

work page 2014
[18]

Multilabel classiﬁcation,

F. Herrera, F. Charte, A. J. Rivera, and M. J. Del Jesus, “Multilabel classiﬁcation,” in Multilabel Classiﬁcation. Springer, 2016, pp. 17–31

work page 2016
[19]

Glove: Global vectors for word representation

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543

work page 2014
[20]

Large-scale learning of word relatedness with constraints,

G. Halawi, G. Dror, E. Gabrilovich, and Y . Koren, “Large-scale learning of word relatedness with constraints,” in KDD. New York, NY , USA: ACM, 2012, pp. 1406–1414. [Online]. Available: http://doi.acm.org/10.1145/2339530.2339751

work page doi:10.1145/2339530.2339751 2012
[21]

Text categorization with support vector machines: Learning with many relevant features,

T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Eu- ropean conference on machine learning. Springer, 1998, pp. 137–142

work page 1998
[22]

When is “nearest neighbor

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Interna- tional conference on database theory. Springer, 1999, pp. 217–235

work page 1999
[23]

Natural language processing to quantify security effort in the software development lifecycle

C. A. Cois and R. Kazman, “Natural language processing to quantify security effort in the software development lifecycle.” in SEKE, 2015, pp. 716–721

work page 2015
[24]

On the naturalness of software,

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in 2012 34th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2012, pp. 837–847

work page 2012
[25]

Training linear svms in linear time,

T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226

work page 2006
[26]

Sentiment polarity detection for software development,

F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Software Engineering, vol. 23, no. 3, pp. 1352–1382, 2018

work page 2018
[27]

Easy over hard: A case study on deep learning,

W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 49–60

work page 2017
[28]

One-against-all multi-class svm classiﬁcation using reliability measures,

Y . Liu and Y . F. Zheng, “One-against-all multi-class svm classiﬁcation using reliability measures,” in Proceed- ings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 849–854. 15 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

work page 2005
[29]

Random search for hyper-parameter optimization,

J. Bergstra and Y . Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012

work page 2012
[30]

Nearest neighbor pattern classiﬁcation,

T. M. Cover, P. E. Hartet al., “Nearest neighbor pattern classiﬁcation,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967

work page 1967
[31]

Ml-knn: A lazy learning approach to multi-label learning,

M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007

work page 2038
[32]

Patterson and A

J. Patterson and A. Gibson, Deep Learning: A Practitioner’s Approach. O’Reilly Media, 2017

work page 2017
[33]

Dropout: a simple way to prevent neural networks from overﬁtting

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting.” Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014. [Online]. Available: http://www.cs.toronto.edu/∼rsalakhu/papers/srivastava14a.pdf

work page 1929
[34]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[35]

Large-scale multi-label text classiﬁcation— revisiting neural networks,

J. Nam, J. Kim, E. L. Menc ´ıa, I. Gurevych, and J. F ¨urnkranz, “Large-scale multi-label text classiﬁcation— revisiting neural networks,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2014, pp. 437–452

work page 2014
[36]

Contextual correlates of semantic similarity,

G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and cognitive pro- cesses, vol. 6, no. 1, pp. 1–28, 1991

work page 1991
[37]

Knowledge-based approaches in software documentation: A systematic literature review,

W. Ding, P. Liang, A. Tang, and H. Van Vliet, “Knowledge-based approaches in software documentation: A systematic literature review,” Information and Software Technology, vol. 56, no. 6, pp. 545–567, 2014

work page 2014
[38]

Inferring method speciﬁcations from natural language api descriptions,

R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar, “Inferring method speciﬁcations from natural language api descriptions,” in Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012, pp. 815–825

work page 2012
[39]

Predicting semantically linkable knowledge in developer online forums via convolutional neural network,

B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting semantically linkable knowledge in developer online forums via convolutional neural network,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 51–62

work page 2016
[40]

Keep it simple: Is deep learning good for linguistic smell detection?

S. Fakhoury, V . Arnaoudova, C. Noiseux, F. Khomh, and G. Antoniol, “Keep it simple: Is deep learning good for linguistic smell detection?” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 602–611

work page 2018
[41]

Natural language or not (nlon)-a package for software engineering text analysis pipeline,

M. M ¨antyl¨a, F. Calefato, and M. Claes, “Natural language or not (nlon)-a package for software engineering text analysis pipeline,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) . IEEE, 2018, pp. 387–391

work page 2018
[42]

The psychological meaning of words: Liwc and computerized text analysis methods,

Y . R. Tausczik and J. W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010

work page 2010
[43]

On user rationale in software engineering,

Z. Kurtanovi ´c and W. Maalej, “On user rationale in software engineering,” Requirements Engineering, vol. 23, no. 3, pp. 357–379, 2018

work page 2018
[44]

Exploring techniques for rationale extraction from existing docu- ments,

B. Rogers, J. Gung, Y . Qiao, and J. E. Burge, “Exploring techniques for rationale extraction from existing docu- ments,” in 2012 34th international conference on software engineering (ICSE). IEEE, 2012, pp. 1313–1316

work page 2012
[45]

Replicated softmax: an undirected topic model,

G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in neural information processing systems, 2009, pp. 1607–1614

work page 2009
[46]

Generalized cross entropy loss for training deep neural networks with noisy labels,

Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8792–8802

work page 2018
[47]

Variants of rmsprop and adagrad with logarithmic regret bounds,

M. C. Mukkamala and M. Hein, “Variants of rmsprop and adagrad with logarithmic regret bounds,” in Proceed- ings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2545–2553

work page 2017
[48]

Creating and evolving developer documentation: understanding the deci- sions of open source contributors,

B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the deci- sions of open source contributors,” in Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 2010, pp. 127–136. 16

work page 2010

[1] [1]

A ﬁeld study of API learning obstacles,

M. P. Robillard and R. DeLine, “A ﬁeld study of API learning obstacles,” Empirical Software Engineering , vol. 16, no. 6, pp. 703–732, 2010

work page 2010

[2] [2]

Improving api documentation usability with knowledge pushing,

U. Dekel and J. D. Herbsleb, “Improving api documentation usability with knowledge pushing,” in Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009, pp. 320–330

work page 2009

[3] [3]

Patterns of Knowledge in API Reference Documentation,

W. Maalej and M. P. Robillard, “Patterns of Knowledge in API Reference Documentation,” IEEE Trans. Softw. Eng., vol. 39, no. 9, pp. 1264–1282, 2013

work page 2013

[4] [4]

Discovering information explaining api types using text classi- ﬁcation,

G. Petrosyan, M. P. Robillard, and R. De Mori, “Discovering information explaining api types using text classi- ﬁcation,” in Proceedings of the 37th International Conference on Software Engineering-Volume 1. IEEE Press, 2015, pp. 869–879

work page 2015

[5] [5]

Recommending reference API documentation,

M. P. Robillard and Y . B. Chhetri, “Recommending reference API documentation,” Empirical Software Engi- neering, vol. 20, no. 6, pp. 1558–1586, Jul. 2014. 14 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

work page 2014

[6] [6]

A case study of api redesign for improved usability,

J. Stylos, B. Graf, D. K. Busse, C. Ziegler, R. Ehret, and J. Karstens, “A case study of api redesign for improved usability,” in Visual Languages and Human-Centric Computing, 2008. VL/HCC 2008. IEEE Symposium on . IEEE, 2008, pp. 189–192

work page 2008

[7] [7]

The implications of method placement on api learnability,

J. Stylos and B. A. Myers, “The implications of method placement on api learnability,” inProceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering. ACM, 2008, pp. 105–112

work page 2008

[8] [8]

What should developers be aware of? An empirical study on the directives of API documentation,

M. Monperrus, M. Eichberg, E. Tekes, and M. Mezini, “What should developers be aware of? An empirical study on the directives of API documentation,” Empirical Software Engineering , vol. 17, no. 6, pp. 703–737, 2011

work page 2011

[9] [9]

An observational study on api usage constraints and their documenta- tion,

M. A. Saied, H. Sahraoui, and B. Dufour, “An observational study on api usage constraints and their documenta- tion,” in Software Analysis, Evolution and Reengineering (SANER), 2015 IEEE 22nd International Conference on. IEEE, 2015, pp. 33–42

work page 2015

[10] [10]

Deep learning: methods and applications,

L. Deng and D. Yu, “Deep learning: methods and applications,” Foundations and Trends® in Signal Processing, vol. 7, no. 3–4, pp. 197–387, 2014

work page 2014

[11] [11]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[12] [12]

Deep learning,

Y . LeCun, Y . Bengio, and G. Hinton, “Deep learning,”Nature, vol. 521, no. 7553, p. 436, 2015

work page 2015

[13] [13]

Distributed representations of words and phrases and their compositionality,

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119

work page 2013

[14] [14]

Area under the precision-recall curve: Point estimates and conﬁdence in- tervals,

K. Boyd, K. H. Eng, and C. D. Page, “Area under the precision-recall curve: Point estimates and conﬁdence in- tervals,” in Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2013, pp. 451–466

work page 2013

[15] [15]

Using auc and accuracy in evaluating learning algorithms,

J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005

work page 2005

[16] [16]

A systematic analysis of performance measures for classiﬁcation tasks,

M. Sokolova and G. Lapalme, “A systematic analysis of performance measures for classiﬁcation tasks,” Infor- mation Processing & Management, vol. 45, no. 4, pp. 427–437, 2009

work page 2009

[17] [17]

Concurrence among imbalanced labels and its inﬂuence on multilabel resampling algorithms,

F. Charte, A. Rivera, M. J. del Jesus, and F. Herrera, “Concurrence among imbalanced labels and its inﬂuence on multilabel resampling algorithms,” in International Conference on Hybrid Artiﬁcial Intelligence Systems . Springer, 2014, pp. 110–121

work page 2014

[18] [18]

Multilabel classiﬁcation,

F. Herrera, F. Charte, A. J. Rivera, and M. J. Del Jesus, “Multilabel classiﬁcation,” in Multilabel Classiﬁcation. Springer, 2016, pp. 17–31

work page 2016

[19] [19]

Glove: Global vectors for word representation

J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation.” in EMNLP, vol. 14, 2014, pp. 1532–1543

work page 2014

[20] [20]

Large-scale learning of word relatedness with constraints,

G. Halawi, G. Dror, E. Gabrilovich, and Y . Koren, “Large-scale learning of word relatedness with constraints,” in KDD. New York, NY , USA: ACM, 2012, pp. 1406–1414. [Online]. Available: http://doi.acm.org/10.1145/2339530.2339751

work page doi:10.1145/2339530.2339751 2012

[21] [21]

Text categorization with support vector machines: Learning with many relevant features,

T. Joachims, “Text categorization with support vector machines: Learning with many relevant features,” in Eu- ropean conference on machine learning. Springer, 1998, pp. 137–142

work page 1998

[22] [22]

When is “nearest neighbor

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is “nearest neighbor” meaningful?” in Interna- tional conference on database theory. Springer, 1999, pp. 217–235

work page 1999

[23] [23]

Natural language processing to quantify security effort in the software development lifecycle

C. A. Cois and R. Kazman, “Natural language processing to quantify security effort in the software development lifecycle.” in SEKE, 2015, pp. 716–721

work page 2015

[24] [24]

On the naturalness of software,

A. Hindle, E. T. Barr, Z. Su, M. Gabel, and P. Devanbu, “On the naturalness of software,” in 2012 34th Interna- tional Conference on Software Engineering (ICSE). IEEE, 2012, pp. 837–847

work page 2012

[25] [25]

Training linear svms in linear time,

T. Joachims, “Training linear svms in linear time,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006, pp. 217–226

work page 2006

[26] [26]

Sentiment polarity detection for software development,

F. Calefato, F. Lanubile, F. Maiorano, and N. Novielli, “Sentiment polarity detection for software development,” Empirical Software Engineering, vol. 23, no. 3, pp. 1352–1382, 2018

work page 2018

[27] [27]

Easy over hard: A case study on deep learning,

W. Fu and T. Menzies, “Easy over hard: A case study on deep learning,” in Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. ACM, 2017, pp. 49–60

work page 2017

[28] [28]

One-against-all multi-class svm classiﬁcation using reliability measures,

Y . Liu and Y . F. Zheng, “One-against-all multi-class svm classiﬁcation using reliability measures,” in Proceed- ings. 2005 IEEE International Joint Conference on Neural Networks, 2005., vol. 2. IEEE, 2005, pp. 849–854. 15 On Using Machine Learning to Identify Knowledge in API Reference Documentation A PREPRINT

work page 2005

[29] [29]

Random search for hyper-parameter optimization,

J. Bergstra and Y . Bengio, “Random search for hyper-parameter optimization,” Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012

work page 2012

[30] [30]

Nearest neighbor pattern classiﬁcation,

T. M. Cover, P. E. Hartet al., “Nearest neighbor pattern classiﬁcation,” IEEE transactions on information theory, vol. 13, no. 1, pp. 21–27, 1967

work page 1967

[31] [31]

Ml-knn: A lazy learning approach to multi-label learning,

M.-L. Zhang and Z.-H. Zhou, “Ml-knn: A lazy learning approach to multi-label learning,” Pattern recognition, vol. 40, no. 7, pp. 2038–2048, 2007

work page 2038

[32] [32]

Patterson and A

J. Patterson and A. Gibson, Deep Learning: A Practitioner’s Approach. O’Reilly Media, 2017

work page 2017

[33] [33]

Dropout: a simple way to prevent neural networks from overﬁtting

N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overﬁtting.” Journal of Machine Learning Research , vol. 15, no. 1, pp. 1929–1958, 2014. [Online]. Available: http://www.cs.toronto.edu/∼rsalakhu/papers/srivastava14a.pdf

work page 1929

[34] [34]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[35] [35]

Large-scale multi-label text classiﬁcation— revisiting neural networks,

J. Nam, J. Kim, E. L. Menc ´ıa, I. Gurevych, and J. F ¨urnkranz, “Large-scale multi-label text classiﬁcation— revisiting neural networks,” in Joint european conference on machine learning and knowledge discovery in databases. Springer, 2014, pp. 437–452

work page 2014

[36] [36]

Contextual correlates of semantic similarity,

G. A. Miller and W. G. Charles, “Contextual correlates of semantic similarity,” Language and cognitive pro- cesses, vol. 6, no. 1, pp. 1–28, 1991

work page 1991

[37] [37]

Knowledge-based approaches in software documentation: A systematic literature review,

W. Ding, P. Liang, A. Tang, and H. Van Vliet, “Knowledge-based approaches in software documentation: A systematic literature review,” Information and Software Technology, vol. 56, no. 6, pp. 545–567, 2014

work page 2014

[38] [38]

Inferring method speciﬁcations from natural language api descriptions,

R. Pandita, X. Xiao, H. Zhong, T. Xie, S. Oney, and A. Paradkar, “Inferring method speciﬁcations from natural language api descriptions,” in Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 2012, pp. 815–825

work page 2012

[39] [39]

Predicting semantically linkable knowledge in developer online forums via convolutional neural network,

B. Xu, D. Ye, Z. Xing, X. Xia, G. Chen, and S. Li, “Predicting semantically linkable knowledge in developer online forums via convolutional neural network,” inProceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 2016, pp. 51–62

work page 2016

[40] [40]

Keep it simple: Is deep learning good for linguistic smell detection?

S. Fakhoury, V . Arnaoudova, C. Noiseux, F. Khomh, and G. Antoniol, “Keep it simple: Is deep learning good for linguistic smell detection?” in 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2018, pp. 602–611

work page 2018

[41] [41]

Natural language or not (nlon)-a package for software engineering text analysis pipeline,

M. M ¨antyl¨a, F. Calefato, and M. Claes, “Natural language or not (nlon)-a package for software engineering text analysis pipeline,” in 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) . IEEE, 2018, pp. 387–391

work page 2018

[42] [42]

The psychological meaning of words: Liwc and computerized text analysis methods,

Y . R. Tausczik and J. W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of language and social psychology, vol. 29, no. 1, pp. 24–54, 2010

work page 2010

[43] [43]

On user rationale in software engineering,

Z. Kurtanovi ´c and W. Maalej, “On user rationale in software engineering,” Requirements Engineering, vol. 23, no. 3, pp. 357–379, 2018

work page 2018

[44] [44]

Exploring techniques for rationale extraction from existing docu- ments,

B. Rogers, J. Gung, Y . Qiao, and J. E. Burge, “Exploring techniques for rationale extraction from existing docu- ments,” in 2012 34th international conference on software engineering (ICSE). IEEE, 2012, pp. 1313–1316

work page 2012

[45] [45]

Replicated softmax: an undirected topic model,

G. E. Hinton and R. R. Salakhutdinov, “Replicated softmax: an undirected topic model,” in Advances in neural information processing systems, 2009, pp. 1607–1614

work page 2009

[46] [46]

Generalized cross entropy loss for training deep neural networks with noisy labels,

Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8792–8802

work page 2018

[47] [47]

Variants of rmsprop and adagrad with logarithmic regret bounds,

M. C. Mukkamala and M. Hein, “Variants of rmsprop and adagrad with logarithmic regret bounds,” in Proceed- ings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2545–2553

work page 2017

[48] [48]

Creating and evolving developer documentation: understanding the deci- sions of open source contributors,

B. Dagenais and M. P. Robillard, “Creating and evolving developer documentation: understanding the deci- sions of open source contributors,” in Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 2010, pp. 127–136. 16

work page 2010