pith. sign in

arxiv: 2606.05493 · v1 · pith:WZZNOJ3Xnew · submitted 2026-06-03 · 💻 cs.SE

REStack: A Large-Scale Dataset of Reverse Engineering Discussions from Stack Exchange

Pith reviewed 2026-06-28 04:51 UTC · model grok-4.3

classification 💻 cs.SE
keywords reverse engineeringstack exchangedatasettopic modelingsoftware engineeringempirical analysiscybersecurity
0
0 comments X

The pith

REStack dataset shows reverse engineering discussions focus on practical debugging and decompilation while memory and firmware topics remain difficult to resolve.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gathers more than 12,000 reverse engineering posts from Stack Overflow and the Reverse Engineering Stack Exchange site spanning 15 years. It applies topic modeling to sort the posts into 23 topics grouped under six themes and adds signals for difficulty such as unanswered rates. The resulting picture is that most discussion stays task-oriented around debugging, decompilation, and system analysis, yet memory, firmware, and file-format topics show notably higher rates of unresolved questions. The authors release the full dataset and scripts so others can run empirical studies, build teaching materials, or train assistance tools on real RE questions.

Core claim

A collection of over 12,000 RE posts can be reduced to 23 coherent topics that demonstrate RE practice is overwhelmingly practical and task-oriented, with debugging, decompilation, and system-level analysis dominating, while memory, firmware, and file-format analysis show elevated difficulty and unresolved rates.

What carries the argument

The REStack dataset, assembled by collecting posts from two Stack Exchange sites and then processed with LDA topic modeling whose hyperparameters were tuned by genetic algorithm, followed by manual labeling into six thematic categories and enrichment with community-derived difficulty metadata.

If this is right

  • Empirical researchers gain a reusable corpus for measuring how RE challenges evolve over time.
  • Educators obtain concrete topic lists for designing targeted training on high-difficulty areas.
  • Developers of AI assistance tools receive labeled examples and difficulty signals for training and evaluation.
  • Tool builders can prioritize support for memory and firmware analysis based on the observed unresolved rates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same collection method could be applied to other narrow software-engineering domains to produce comparable difficulty maps.
  • Difficulty signals derived from unanswered rates could be tested as predictors of which RE questions would benefit most from automated help.
  • The topic structure offers a starting point for defining benchmark tasks that future RE tools must handle.

Load-bearing premise

That the combination of LDA with genetic-algorithm tuning and subsequent manual labeling yields 23 topics that faithfully reflect the actual distribution of challenges in the collected posts.

What would settle it

A fresh run of the same modeling pipeline on the identical post collection that produces a markedly different set of topics or that shows low human agreement on the manual labels.

read the original abstract

Reverse engineering (RE) is a critical activity in software engineering and cybersecurity, supporting tasks such as malware analysis, vulnerability discovery, legacy system maintenance, and firmware inspection. Despite its importance, there is limited empirical understanding of the challenges, topics, and knowledge gaps faced by RE practitioners in real-world settings, and no publicly available dataset has systematically captured RE discussions from developer Q&A forums. In this paper, we present REStack, a large-scale dataset of RE discussions collected from Stack Overflow and the dedicated Reverse Engineering Stack Exchange site. The dataset comprises over 12,000 RE-related posts spanning more than 15 years. Using Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA)-based hyperparameter optimization, followed by manual topic labeling, we identify 23 semantically coherent RE topics grouped into six high-level thematic categories. The dataset is further enriched with metadata and difficulty indicators derived from community interaction signals, such as unanswered rates and response times. Our analysis reveals that RE discussions are predominantly practical and task-oriented, with strong emphasis on debugging, decompilation, and system-level analysis, while topics related to memory, firmware, and file format analysis exhibit high difficulty and unresolved rates. Beyond empirical characterization, REStack provides a reusable resource for empirical studies, educational research, and the development and evaluation of AI- and LLM-based developer assistance tools for RE. By releasing the dataset and accompanying scripts, this work aims to facilitate reproducible research and advance data-driven support for RE practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents REStack, a dataset of over 12,000 reverse engineering (RE) posts collected from Stack Overflow and the Reverse Engineering Stack Exchange site spanning more than 15 years. It applies Latent Dirichlet Allocation (LDA) with Genetic Algorithm (GA) hyperparameter optimization, followed by manual labeling, to derive 23 topics grouped into six thematic categories. The work enriches the dataset with metadata and difficulty indicators (e.g., unanswered rates, response times), analyzes that RE discussions are predominantly practical/task-oriented (emphasizing debugging, decompilation, system-level analysis) while memory/firmware/file-format topics show high difficulty/unresolved rates, and releases the dataset plus scripts to support empirical studies, education, and AI/LLM tool development for RE.

Significance. If the topic characterization holds, the release of a large-scale, publicly available RE discussion dataset with derived difficulty signals constitutes a reusable resource for empirical SE research, educational analysis, and benchmarking of AI assistance tools. The paper's emphasis on reproducibility via released scripts and data is a clear strength.

major comments (3)
  1. [Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.
  2. [Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.
  3. [Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract and Introduction: The claim of 'no publicly available dataset' should be qualified with a brief comparison to any prior RE-related corpora (even if smaller or narrower) to strengthen novelty positioning.
  2. [Results] Results section: When reporting unresolved rates and response times per topic, include the raw counts or denominators alongside percentages for transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications and committing to revisions that enhance the manuscript's methodological transparency and reproducibility.

read point-by-point responses
  1. Referee: [Methods] Methods (Topic Modeling and Labeling subsection): No coherence metrics (NPMI, C_V, or similar), topic stability across random seeds, or ablation on the GA objective function are reported for the 23 topics. This directly undermines the claim that the topics are 'semantically coherent' and the downstream grouping into six categories plus difficulty/unresolved-rate analysis.

    Authors: We agree that quantitative validation metrics would strengthen the presentation. In the revised manuscript we will report NPMI and C_V coherence scores for the final 23-topic model. We will also include a stability analysis by re-running LDA across multiple random seeds and reporting average pairwise Jaccard similarity of topic-word distributions. For the GA component we will specify the objective function (perplexity) and note that no ablation was performed due to computational cost; if space allows we will add a brief sensitivity check on key GA parameters. revision: yes

  2. Referee: [Methods] Methods (Data Collection subsection): Exact search queries, filtering criteria, inclusion/exclusion rules, and post-selection validation steps used to obtain the 12,000+ posts are not specified. This affects both reproducibility of the core dataset and the representativeness of the analyzed RE discussions.

    Authors: We acknowledge that precise collection details are essential. The posts were obtained via the Stack Exchange Data Explorer using tag-based filters ('reverse-engineering' on SO and the dedicated site) combined with keyword matching in titles and bodies. In the revision we will list the exact SQL queries, the date range, minimum score/post-length thresholds, and the manual sampling procedure used to verify relevance. This will enable full replication of the 12k+ post set. revision: yes

  3. Referee: [Methods] Methods (Labeling process): No inter-rater agreement statistics (e.g., Cohen's kappa or percentage agreement) are provided for the manual labeling of topics and assignment to the six high-level categories. Without this, the subjectivity concern in the post-hoc grouping cannot be assessed.

    Authors: The topic-to-category assignment was performed jointly by the author team with iterative discussion to reach consensus. We will add a dedicated paragraph reporting the agreement process: percentage agreement on the final groupings and, where multiple independent passes were feasible, Cohen's kappa. This will directly address concerns about subjectivity in the six-category taxonomy. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and topic modeling are self-contained

full rationale

The paper collects external Stack Exchange posts, applies standard LDA with GA hyperparameter search, performs manual labeling, and computes difficulty metrics from community signals. No equations, fitted parameters renamed as predictions, self-citation chains, or uniqueness claims appear in the provided text. The central claims rest on the collected data and topic outputs rather than reducing to inputs by definition. This is the expected outcome for a dataset paper using off-the-shelf methods.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The contribution rests on the representativeness of Stack Exchange posts as a proxy for real RE practice and on the validity of LDA for discovering coherent topics in technical forum text.

free parameters (2)
  • Number of topics = 23
    Chosen as 23 after GA-optimized LDA and manual review
  • LDA hyperparameters
    Optimized via genetic algorithm; specific values not stated in abstract
axioms (2)
  • domain assumption Stack Exchange posts accurately reflect real-world RE practitioner challenges and knowledge gaps
    Core premise for treating the collected posts as representative data
  • domain assumption LDA with GA tuning yields semantically meaningful and coherent topics in technical Q&A text
    Justifies the identification of the 23 topics and their subsequent manual labeling

pith-pipeline@v0.9.1-grok · 5803 in / 1467 out tokens · 34674 ms · 2026-06-28T04:51:15.223436+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 15 canonical work pages

  1. [1]

    Ahmad, D

    A. Ahmad, D. Costa, K. Badran, R. Abdalkareem, and E. Shihab. 2020. Challenges in Chatbot Development: A Study of Stack Overflow Posts. InProceedings of the 17th International Conference on Mining Software Repositories. https://doi.org/10. 1145/3379597.3387472

  2. [2]

    Ahmed and M

    S. Ahmed and M. Bagherzadeh. 2018. What Do Concurrency Developers Ask About? A Large-Scale Study Using Stack Overflow. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM ’18). 1–10. https://doi.org/10.1145/3239235.3239524

  3. [3]

    Alibrahim and S

    H. Alibrahim and S. Ludwig. 2021. Hyperparameter Optimization: Comparing Genetic Algorithm against Grid Search and Bayesian Optimization. InProceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC). 1551–1559. https: //doi.org/10.1109/CEC45853.2021.9504761

  4. [4]

    Bagherzadeh and R

    M. Bagherzadeh and R. Khatchadourian. 2019. Going big: A Large-scale Study on What Big Data Developers Ask. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). 432–442. https://doi.org/ 10.1145/3338906.3338939

  5. [5]

    D. M. Blei, A. Y. Ng, and M. I. Jordan. 2003. Latent Dirichlet Allocation.J. Mach. Learn. Res.3 (2003), 993–1022

  6. [6]

    Norman Cliff. 1993. Dominance Statistics: Ordinal Analyses to Answer Ordinal Questions.Psychological Bulletin114, 3 (1993), 494–509. https://doi.org/10.1037/ 0033-2909.114.3.494

  7. [7]

    1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

    Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates, Hillsdale, NJ

  8. [8]

    Olive Jean Dunn. 1964. Multiple Comparisons Using Rank Sums.Technometrics 6, 3 (1964), 241–252. https://doi.org/10.2307/1266041

  9. [9]

    Gensim. 2025. https://radimrehurek.com/gensim/. accessed March, 2026

  10. [10]

    J. Holland. 1992. Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence.MIT Press1, 1 (1992), 1–228. https://doi.org/10.7551/mitpress/1090.001.0001

  11. [11]

    Introduction to card sorting. 2025. https://www.optimalworkshop.com/ 101guides/card-sorting-101/introduction-to-card-sorting. accessed March, 2026

  12. [12]

    Kruskal and W

    William H. Kruskal and W. Allen Wallis. 1952. Use of Ranks in One-Criterion Variance Analysis.J. Amer. Statist. Assoc.47, 260 (1952), 583–621. https://doi. org/10.2307/2280779

  13. [13]

    Richard Landis and Gary G

    J. Richard Landis and Gary G. Koch. 1977. The Measurement of Observer Agreement for Categorical Data.Biometrics33, 1 (1977), 159–174. https: //doi.org/10.2307/2529310

  14. [14]

    C. Li, J. Jiang, Y. Zhao, R. Li, E. Wang, X. Zhang, and K. Zhao. 2021. Genetic Algorithm-Based Hyper-Parameters Optimization for Transfer Convolutional Neural Network.arXiv preprint(2021). https://doi.org/10.48550/arXiv.2103.03875 arXiv:2103.03875

  15. [15]

    Natural Language Toolkit (NLTK) Stop Words. 2025. https://gist.github.com/ sebleier/554280. accessed March, 2026

  16. [16]

    Openja, B

    M. Openja, B. Adams, and F. Khomh. 2020. Analysis of Modern Release Engineer- ing Topics: A Large-Scale Study Using Stack Overflow. InProceedings of the 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). 104–114. https://doi.org/10.1109/ICSME46990.2020.00020

  17. [17]

    A. Ouni, I. Saidani, E. Alomar, and M. Mkaouer. 2023. An Empirical Study on Continuous Integration Trends, Topics and Challenges in Stack Overflow. In Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering. 141–151. https://doi.org/10.1145/3593434.3593485

  18. [18]

    Peruma, S

    A. Peruma, S. Simmons, E. A. AlOmar, C. D. Newman, M. W. Mkaouer, and A. Ouni. 2022. How do i refactor this? An empirical study on refactoring trends and topics in Stack Overflow.Empirical Softw. Engg.27, 1 (2022), 1–43. https: //doi.org/10.1007/s10664-021-10045-x

  19. [19]

    Replication Package. 2026. https://figshare.com/s/a1eca7ed23c8f3b1fe78. accessed March, 2026

  20. [20]

    Reverse Engineering Site. 2025. https://reverseengineering.stackexchange.com//. accessed March, 2026

  21. [21]

    Röder, A

    M. Röder, A. Both, and A. Hinneburg. 2015. Exploring the space of topic coherence measures. InProceedings of the ACM International Conference on Web Search and Data Mining (WSDM). ACM, 399–408

  22. [22]

    Romano, J

    J. Romano, J. Kromrey, J. Coraggio, and J. Skowronek. 2006. Appropriate Statistics for Ordinal Level Data: Should We Really Be Using t-test and Cohen’s d for Evaluating Group Differences on the NSSE and Other Surveys?Annual Meeting of the Florida Association of Institutional Research(2006), 1–33

  23. [23]

    Rosen and E

    C. Rosen and E. Shihab. 2016. What Are Mobile Developers Asking About? A Large-Scale Study Using Stack Overflow.Empirical Software Engineering21 (2016), 1192–1223. https://doi.org/10.1007/s10664-015-9379-3

  24. [24]

    Saidani, A

    I. Saidani, A. Ouni, and M. Mkaouer. 2022. Improving the prediction of continuous integration build failures using deep learning.Automated Software Engineering 29, 1 (2022), 1–61

  25. [25]

    Charles Spearman. 1904. The Proof and Measurement of Association Between Two Things.The American Journal of Psychology15, 1 (1904), 72–101. https: //doi.org/10.2307/1412159

  26. [26]

    Stack Exchange. 2025. https://stackexchange.com/. accessed March, 2026

  27. [27]

    Stack Overflow Site. 2025. https://stackoverflow.com/. accessed March, 2026

  28. [28]

    Uddin, F

    G. Uddin, F. Sabir, Y. Guéhéneuc, O. Alam, and F. Khomh. 2021. An Empirical Study of IoT Topics in IoT Developer Discussions on Stack Overflow.Empirical Software Engineering26, 6 (2021). https://doi.org/10.1007/s10664-021-10021-5

  29. [29]

    Yang and A

    L. Yang and A. Shami. 2020. On Hyperparameter Optimization of Machine Learning Algorithms: Theory and Practice.Neurocomputing415 (2020), 295–316. https://doi.org/10.1016/j.neucom.2020.07.061

  30. [30]

    X. Yang, D. Lo, X. Xia, Z. Wan, and J. Sun. 2016. What security questions do developers ask? a large-scale study of stack overflow posts.Journal of Computer Science and Technology31 (2016), 910–924