REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

Farha Kamal; Md Humaun Kabir; Md Rakibul Islam

arxiv: 2606.05493 · v1 · pith:WZZNOJ3Xnew · submitted 2026-06-03 · 💻 cs.SE

REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

Md Humaun Kabir , Md Rakibul Islam , Farha Kamal This is my paper

Pith reviewed 2026-06-28 04:51 UTC · model grok-4.3

classification 💻 cs.SE

keywords reverse engineeringstack exchangedatasettopic modelingsoftware engineeringempirical analysiscybersecurity

0 comments

The pith

REStack dataset shows reverse engineering discussions focus on practical debugging and decompilation while memory and firmware topics remain difficult to resolve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gathers more than 12,000 reverse engineering posts from Stack Overflow and the Reverse Engineering Stack Exchange site spanning 15 years. It applies topic modeling to sort the posts into 23 topics grouped under six themes and adds signals for difficulty such as unanswered rates. The resulting picture is that most discussion stays task-oriented around debugging, decompilation, and system analysis, yet memory, firmware, and file-format topics show notably higher rates of unresolved questions. The authors release the full dataset and scripts so others can run empirical studies, build teaching materials, or train assistance tools on real RE questions.

Core claim

A collection of over 12,000 RE posts can be reduced to 23 coherent topics that demonstrate RE practice is overwhelmingly practical and task-oriented, with debugging, decompilation, and system-level analysis dominating, while memory, firmware, and file-format analysis show elevated difficulty and unresolved rates.

What carries the argument

The REStack dataset, assembled by collecting posts from two Stack Exchange sites and then processed with LDA topic modeling whose hyperparameters were tuned by genetic algorithm, followed by manual labeling into six thematic categories and enrichment with community-derived difficulty metadata.

If this is right

Empirical researchers gain a reusable corpus for measuring how RE challenges evolve over time.
Educators obtain concrete topic lists for designing targeted training on high-difficulty areas.
Developers of AI assistance tools receive labeled examples and difficulty signals for training and evaluation.
Tool builders can prioritize support for memory and firmware analysis based on the observed unresolved rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same collection method could be applied to other narrow software-engineering domains to produce comparable difficulty maps.
Difficulty signals derived from unanswered rates could be tested as predictors of which RE questions would benefit most from automated help.
The topic structure offers a starting point for defining benchmark tasks that future RE tools must handle.

Load-bearing premise

That the combination of LDA with genetic-algorithm tuning and subsequent manual labeling yields 23 topics that faithfully reflect the actual distribution of challenges in the collected posts.

What would settle it

A fresh run of the same modeling pipeline on the identical post collection that produces a markedly different set of topics or that shows low human agreement on the manual labels.

read the original abstract

Reverse engineering (RE) is a critical activity in software engineering and cybersecurity, supporting tasks such as malware analysis, vulnerability discovery, legacy system maintenance, and firmware inspection. Despite its importance, there is limited empirical understanding of the challenges, topics, and knowledge gaps faced by RE practitioners in real-world settings, and no publicly available dataset has systematically captured RE discussions from developer Q&A forums. In this paper, we present REStack, a large-scale dataset of RE discussions collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site. The dataset comprises over 12,000 RE-related posts spanning more than 15 years. Using Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA)-based hyperparameter optimization, followed by manual topic labeling, we identify 23 semantically coherent RE topics grouped into six high-level thematic categories. The dataset is further enriched with metadata and difficulty indicators derived from community interaction signals, such as unanswered rates and response times. Our analysis reveals that RE discussions are predominantly practical and task-oriented, with strong emphasis on debugging, decompilation, and system-level analysis, while topics related to memory, firmware, and file format analysis exhibit high difficulty and unresolved rates. Beyond empirical characterization, REStack provides a reusable resource for empirical studies, educational research, and the development and evaluation of AI- and LLM-based developer assistance tools for RE. By releasing the dataset and accompanying scripts, this work aims to facilitate reproducible research and advance data-driven support for RE practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

REStack releases a sizable new dataset of RE discussions that fills a clear gap, but the topic modeling validation is too thin to fully support the difficulty and category claims.

read the letter

REStack is mainly a data release: over 12,000 posts from Stack Overflow and the dedicated Reverse Engineering site, spanning 15 years, with added difficulty signals from community metrics. That alone addresses a real absence of large public RE discussion corpora.

The work applies LDA with GA-tuned hyperparameters to extract 23 topics, then manually groups them into six themes. The abstract reports that practical debugging and decompilation dominate while memory, firmware, and file-format topics show higher unresolved rates. Releasing the dataset plus scripts is the right step for this kind of paper.

The collection scale and the practitioner focus are the parts that hold up without extra assumptions. The analysis direction also matches what one would expect from RE forums.

The weaker part is the topic characterization itself. The methods rely on LDA plus manual labeling, yet the abstract gives no coherence scores, no inter-rater agreement for the labels, and no stability checks across runs. The post-hoc grouping into six categories adds another layer of judgment. Without those numbers it is hard to treat the 23-topic breakdown or the difficulty rankings as firmly grounded. Exact collection queries and filtering criteria are also not described, which limits immediate reuse.

This is the sort of resource paper that empirical SE and security researchers will want to cite for downstream studies or tool evaluation. It deserves peer review so the pipeline details and validation can be checked; the dataset contribution is substantial enough to justify the effort even if the current analysis needs tightening.

Referee Report

3 major / 2 minor

Summary. The paper presents REStack, a dataset of over 12,000 reverse engineering (RE) posts collected from Stack Overflow and the Reverse Engineering Stack Exchange site spanning more than 15 years. It applies Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA) hyperparameter optimization, followed by manual labeling, to derive 23 topics grouped into six thematic categories. The work enriches the dataset with metadata and difficulty indicators (e.g., unanswered rates, response times), analyzes that RE discussions are predominantly practical/task-oriented (emphasizing debugging, decompilation, system-level analysis) while memory/firmware/file-format topics show high difficulty/unresolved rates, and releases the dataset plus scripts to support empirical studies, education, and AI/LLM tool development for RE.

Significance. If the topic characterization holds, the release of a large-scale, publicly available RE discussion dataset with derived difficulty signals constitutes a reusable resource for empirical SE research, educational analysis, and benchmarking of AI assistance tools. The paper's emphasis on reproducibility via released scripts and data is a clear strength.

major comments (3)

[Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.
[Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.
[Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.

minor comments (2)

[Abstract] Abstract and Introduction: The claim of 'no publicly available dataset' should be qualified with a brief comparison to any prior RE-related corpora (even if smaller or narrower) to strengthen novelty positioning.
[Results] Results section: When reporting unresolved rates and response times per topic, include the raw counts or denominators alongside percentages for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that enhance the manuscript's methodological transparency and reproducibility.

read point-by-point responses

Referee: [Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.

Authors: We agree that quantitative validation metrics would strengthen the presentation. In the revised manuscript we will report NPMI and C_V coherence scores for the final 23-topic model. We will also include a stability analysis by re-running LDA across multiple random seeds and reporting average pairwise Jaccard similarity of topic-word distributions. For the GA component we will specify the objective function (perplexity) and note that no ablation was performed due to computational cost; if space allows we will add a brief sensitivity check on key GA parameters. revision: yes
Referee: [Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.

Authors: We acknowledge that precise collection details are essential. The posts were obtained via the Stack Exchange Data Explorer using tag-based filters ('reverse-engineering' on SO and the dedicated site) combined with keyword matching in titles and bodies. In the revision we will list the exact SQL queries, the date range, minimum score/post-length thresholds, and the manual sampling procedure used to verify relevance. This will enable full replication of the 12k+ post set. revision: yes
Referee: [Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.

Authors: The topic-to-category assignment was performed jointly by the author team with iterative discussion to reach consensus. We will add a dedicated paragraph reporting the agreement process: percentage agreement on the final groupings and, where multiple independent passes were feasible, Cohen's kappa. This will directly address concerns about subjectivity in the six-category taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and topic modeling are self-contained

full rationale

The paper collects external Stack Exchange posts, applies standard LDA with GA hyperparameter search, performs manual labeling, and computes difficulty metrics from community signals. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness claims appear in the provided text. The central claims rest on the collected data and topic outputs rather than reducing to inputs by definition. This is the expected outcome for a dataset paper using off-the-shelf methods.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The contribution rests on the representativeness of Stack Exchange posts as a proxy for real RE practice and on the validity of LDA for discovering coherent topics in technical forum text.

free parameters (2)

Number of topics = 23
Chosen as 23 after GA-optimized LDA and manual review
LDA hyperparameters
Optimized via genetic algorithm; specific values not stated in abstract

axioms (2)

domain assumption Stack Exchange posts accurately reflect real-world RE practitioner challenges and knowledge gaps
Core premise for treating the collected posts as representative data
domain assumption LDA with GA tuning yields semantically meaningful and coherent topics in technical Q&A text
Justifies the identification of the 23 topics and their subsequent manual labeling

pith-pipeline@v0.9.1-grok · 5803 in / 1467 out tokens · 34674 ms · 2026-06-28T04:51:15.223436+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages

[1]

Ahmad, D

A. Ahmad, D. Costa, K. Badran, R. Abdalkareem, and E. Shihab. 2020. Challenges in Chatbot Development: A Study of Stack Overflow Posts. InProceedings of the 17th International Conference on Mining Software Repositories. https://doi.org/10. 1145/3379597.3387472

arXiv 2020
[2]

Ahmed and M

S. Ahmed and M. Bagherzadeh. 2018. What Do Concurrency Developers Ask About? A Large-Scale Study Using Stack Overflow. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM ’18). 1–10. https://doi.org/10.1145/3239235.3239524

work page doi:10.1145/3239235.3239524 2018
[3]

Alibrahim and S

H. Alibrahim and S. Ludwig. 2021. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. InProceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC). 1551–1559. https: //doi.org/10.1109/CEC45853.2021.9504761

work page doi:10.1109/cec45853.2021.9504761 2021
[4]

Bagherzadeh and R

M. Bagherzadeh and R. Khatchadourian. 2019. Going big: A Large-scale Study on What Big Data Developers Ask. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). 432–442. https://doi.org/ 10.1145/3338906.3338939

work page doi:10.1145/3338906.3338939 2019
[5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation.J. Mach. Learn. Res.3 (2003), 993–1022

2003
[6]

Norman Cliff. 1993. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions.Psychological Bulletin114, 3 (1993), 494–509. https://doi.org/10.1037/ 0033-2909.114.3.494

1993
[7]

1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

1988
[8]

Olive Jean Dunn. 1964. Multiple Comparisons Using Rank Sums.Technometrics 6, 3 (1964), 241–252. https://doi.org/10.2307/1266041

work page doi:10.2307/1266041 1964
[9]

Gensim. 2025. https://radimrehurek.com/gensim/. accessed March, 2026

2025
[10]

J. Holland. 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence.MIT Press1, 1 (1992), 1–228. https://doi.org/10.7551/mitpress/1090.001.0001

work page doi:10.7551/mitpress/1090.001.0001 1992
[11]

Introduction to card sorting. 2025. https://www.optimalworkshop.com/ 101guides/card-sorting-101/introduction-to-card-sorting. accessed March, 2026

2025
[12]

Kruskal and W

William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.J. Amer. Statist. Assoc.47, 260 (1952), 583–621. https://doi. org/10.2307/2280779

work page doi:10.2307/2280779 1952
[13]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics33, 1 (1977), 159–174. https: //doi.org/10.2307/2529310

work page doi:10.2307/2529310 1977
[14]

C. Li, J. Jiang, Y. Zhao, R. Li, E. Wang, X. Zhang, and K. Zhao. 2021. Genetic Algorithm-Based Hyper-Parameters Optimization for Transfer Convolutional Neural Network.arXiv preprint(2021). https://doi.org/10.48550/arXiv.2103.03875 arXiv:2103.03875

work page doi:10.48550/arxiv.2103.03875 2021
[15]

Natural Language Toolkit (NLTK) Stop Words. 2025. https://gist.github.com/ sebleier/554280. accessed March, 2026

2025
[16]

Openja, B

M. Openja, B. Adams, and F. Khomh. 2020. Analysis of Modern Release Engineer- ing Topics: A Large-Scale Study Using Stack Overflow. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 104–114. https://doi.org/10.1109/ICSME46990.2020.00020

work page doi:10.1109/icsme46990.2020.00020 2020
[17]

A. Ouni, I. Saidani, E. Alomar, and M. Mkaouer. 2023. An Empirical Study on Continuous Integration Trends, Topics and Challenges in Stack Overflow. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 141–151. https://doi.org/10.1145/3593434.3593485

work page doi:10.1145/3593434.3593485 2023
[18]

Peruma, S

A. Peruma, S. Simmons, E. A. AlOmar, C. D. Newman, M. W. Mkaouer, and A. Ouni. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overflow.Empirical Softw. Engg.27, 1 (2022), 1–43. https: //doi.org/10.1007/s10664-021-10045-x

work page doi:10.1007/s10664-021-10045-x 2022
[19]

Replication Package. 2026. https://figshare.com/s/a1eca7ed23c8f3b1fe78. accessed March, 2026

2026
[20]

Reverse Engineering Site. 2025. https://reverseengineering.stackexchange.com//. accessed March, 2026

2025
[21]

Röder, A

M. Röder, A. Both, and A. Hinneburg. 2015. Exploring the space of topic coherence measures. InProceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 399–408

2015
[22]

Romano, J

J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-test and Cohen’s d for Evaluating Group Differences on the NSSE and Other Surveys?Annual Meeting of the Florida Association of Institutional Research(2006), 1–33

2006
[23]

Rosen and E

C. Rosen and E. Shihab. 2016. What Are Mobile Developers Asking About? A Large-Scale Study Using Stack Overflow.Empirical Software Engineering21 (2016), 1192–1223. https://doi.org/10.1007/s10664-015-9379-3

work page doi:10.1007/s10664-015-9379-3 2016
[24]

Saidani, A

I. Saidani, A. Ouni, and M. Mkaouer. 2022. Improving the prediction of continuous integration build failures using deep learning.Automated Software Engineering 29, 1 (2022), 1–61

2022
[25]

Charles Spearman. 1904. The Proof and Measurement of Association Between Two Things.The American Journal of Psychology15, 1 (1904), 72–101. https: //doi.org/10.2307/1412159

work page doi:10.2307/1412159 1904
[26]

Stack Exchange. 2025. https://stackexchange.com/. accessed March, 2026

2025
[27]

Stack Overflow Site. 2025. https://stackoverflow.com/. accessed March, 2026

2025
[28]

Uddin, F

G. Uddin, F. Sabir, Y. Guéhéneuc, O. Alam, and F. Khomh. 2021. An Empirical Study of IoT Topics in IoT Developer Discussions on Stack Overflow.Empirical Software Engineering26, 6 (2021). https://doi.org/10.1007/s10664-021-10021-5

work page doi:10.1007/s10664-021-10021-5 2021
[29]

Yang and A

L. Yang and A. Shami. 2020. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice.Neurocomputing415 (2020), 295–316. https://doi.org/10.1016/j.neucom.2020.07.061

work page doi:10.1016/j.neucom.2020.07.061 2020
[30]

X. Yang, D. Lo, X. Xia, Z. Wan, and J. Sun. 2016. What security questions do developers ask? a large-scale study of stack overflow posts.Journal of Computer Science and Technology31 (2016), 910–924

2016

[1] [1]

Ahmad, D

A. Ahmad, D. Costa, K. Badran, R. Abdalkareem, and E. Shihab. 2020. Challenges in Chatbot Development: A Study of Stack Overflow Posts. InProceedings of the 17th International Conference on Mining Software Repositories. https://doi.org/10. 1145/3379597.3387472

arXiv 2020

[2] [2]

Ahmed and M

S. Ahmed and M. Bagherzadeh. 2018. What Do Concurrency Developers Ask About? A Large-Scale Study Using Stack Overflow. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM ’18). 1–10. https://doi.org/10.1145/3239235.3239524

work page doi:10.1145/3239235.3239524 2018

[3] [3]

Alibrahim and S

H. Alibrahim and S. Ludwig. 2021. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. InProceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC). 1551–1559. https: //doi.org/10.1109/CEC45853.2021.9504761

work page doi:10.1109/cec45853.2021.9504761 2021

[4] [4]

Bagherzadeh and R

M. Bagherzadeh and R. Khatchadourian. 2019. Going big: A Large-scale Study on What Big Data Developers Ask. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). 432–442. https://doi.org/ 10.1145/3338906.3338939

work page doi:10.1145/3338906.3338939 2019

[5] [5]

D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation.J. Mach. Learn. Res.3 (2003), 993–1022

2003

[6] [6]

Norman Cliff. 1993. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions.Psychological Bulletin114, 3 (1993), 494–509. https://doi.org/10.1037/ 0033-2909.114.3.494

1993

[7] [7]

1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

1988

[8] [8]

Olive Jean Dunn. 1964. Multiple Comparisons Using Rank Sums.Technometrics 6, 3 (1964), 241–252. https://doi.org/10.2307/1266041

work page doi:10.2307/1266041 1964

[9] [9]

Gensim. 2025. https://radimrehurek.com/gensim/. accessed March, 2026

2025

[10] [10]

J. Holland. 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence.MIT Press1, 1 (1992), 1–228. https://doi.org/10.7551/mitpress/1090.001.0001

work page doi:10.7551/mitpress/1090.001.0001 1992

[11] [11]

Introduction to card sorting. 2025. https://www.optimalworkshop.com/ 101guides/card-sorting-101/introduction-to-card-sorting. accessed March, 2026

2025

[12] [12]

Kruskal and W

William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.J. Amer. Statist. Assoc.47, 260 (1952), 583–621. https://doi. org/10.2307/2280779

work page doi:10.2307/2280779 1952

[13] [13]

Richard Landis and Gary G

J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics33, 1 (1977), 159–174. https: //doi.org/10.2307/2529310

work page doi:10.2307/2529310 1977

[14] [14]

C. Li, J. Jiang, Y. Zhao, R. Li, E. Wang, X. Zhang, and K. Zhao. 2021. Genetic Algorithm-Based Hyper-Parameters Optimization for Transfer Convolutional Neural Network.arXiv preprint(2021). https://doi.org/10.48550/arXiv.2103.03875 arXiv:2103.03875

work page doi:10.48550/arxiv.2103.03875 2021

[15] [15]

Natural Language Toolkit (NLTK) Stop Words. 2025. https://gist.github.com/ sebleier/554280. accessed March, 2026

2025

[16] [16]

Openja, B

M. Openja, B. Adams, and F. Khomh. 2020. Analysis of Modern Release Engineer- ing Topics: A Large-Scale Study Using Stack Overflow. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 104–114. https://doi.org/10.1109/ICSME46990.2020.00020

work page doi:10.1109/icsme46990.2020.00020 2020

[17] [17]

A. Ouni, I. Saidani, E. Alomar, and M. Mkaouer. 2023. An Empirical Study on Continuous Integration Trends, Topics and Challenges in Stack Overflow. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 141–151. https://doi.org/10.1145/3593434.3593485

work page doi:10.1145/3593434.3593485 2023

[18] [18]

Peruma, S

A. Peruma, S. Simmons, E. A. AlOmar, C. D. Newman, M. W. Mkaouer, and A. Ouni. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overflow.Empirical Softw. Engg.27, 1 (2022), 1–43. https: //doi.org/10.1007/s10664-021-10045-x

work page doi:10.1007/s10664-021-10045-x 2022

[19] [19]

Replication Package. 2026. https://figshare.com/s/a1eca7ed23c8f3b1fe78. accessed March, 2026

2026

[20] [20]

Reverse Engineering Site. 2025. https://reverseengineering.stackexchange.com//. accessed March, 2026

2025

[21] [21]

Röder, A

M. Röder, A. Both, and A. Hinneburg. 2015. Exploring the space of topic coherence measures. InProceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 399–408

2015

[22] [22]

Romano, J

J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-test and Cohen’s d for Evaluating Group Differences on the NSSE and Other Surveys?Annual Meeting of the Florida Association of Institutional Research(2006), 1–33

2006

[23] [23]

Rosen and E

C. Rosen and E. Shihab. 2016. What Are Mobile Developers Asking About? A Large-Scale Study Using Stack Overflow.Empirical Software Engineering21 (2016), 1192–1223. https://doi.org/10.1007/s10664-015-9379-3

work page doi:10.1007/s10664-015-9379-3 2016

[24] [24]

Saidani, A

I. Saidani, A. Ouni, and M. Mkaouer. 2022. Improving the prediction of continuous integration build failures using deep learning.Automated Software Engineering 29, 1 (2022), 1–61

2022

[25] [25]

Charles Spearman. 1904. The Proof and Measurement of Association Between Two Things.The American Journal of Psychology15, 1 (1904), 72–101. https: //doi.org/10.2307/1412159

work page doi:10.2307/1412159 1904

[26] [26]

Stack Exchange. 2025. https://stackexchange.com/. accessed March, 2026

2025

[27] [27]

Stack Overflow Site. 2025. https://stackoverflow.com/. accessed March, 2026

2025

[28] [28]

Uddin, F

G. Uddin, F. Sabir, Y. Guéhéneuc, O. Alam, and F. Khomh. 2021. An Empirical Study of IoT Topics in IoT Developer Discussions on Stack Overflow.Empirical Software Engineering26, 6 (2021). https://doi.org/10.1007/s10664-021-10021-5

work page doi:10.1007/s10664-021-10021-5 2021

[29] [29]

Yang and A

L. Yang and A. Shami. 2020. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice.Neurocomputing415 (2020), 295–316. https://doi.org/10.1016/j.neucom.2020.07.061

work page doi:10.1016/j.neucom.2020.07.061 2020

[30] [30]

X. Yang, D. Lo, X. Xia, Z. Wan, and J. Sun. 2016. What security questions do developers ask? a large-scale study of stack overflow posts.Journal of Computer Science and Technology31 (2016), 910–924

2016