An Improved Quantum Software Challenges Classification Approach using Transfer Learning and Explainable AI
Pith reviewed 2026-05-18 13:59 UTC · model grok-4.3
The pith
Transformer models classify quantum software challenges from Stack Overflow posts at 95 percent accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors extract 2829 questions using quantum-related tags, apply content analysis and grounded theory to define six challenge categories, annotate the posts through human review plus ChatGPT validation to create ground truth, and show that fine-tuned BERT and DistilBERT models reach an average 95 percent accuracy in assigning posts to these categories while fine-tuned feedforward, convolutional, and LSTM networks reach 89, 86, and 84 percent; the transformer method gains a six-point edge by operating directly on the unaltered discussions and SHAP explanations reveal the linguistic features that drive each classification.
What carries the argument
Fine-tuned transformer models such as BERT and DistilBERT paired with SHAP value explanations that map word patterns in the posts to one of six categories: Tooling, Theoretical, Learning, Conceptual, Errors, and API Usage.
If this is right
- Quantum vendors and forum operators can apply the trained models to automatically organize and surface discussions for quicker developer access.
- The six-category taxonomy provides a stable reference frame for tracking which challenges appear most often over time.
- SHAP explanations allow maintainers to inspect and adjust the model when particular linguistic cues produce unexpected assignments.
- The same transfer-learning pipeline can be reused on other specialized software domains once comparable labeled discussions exist.
Where Pith is reading between the lines
- Deploying the classifier inside live forums could generate suggested tags or related threads while a question is being written.
- Periodic retraining on newer posts would test whether the original six categories continue to capture emerging issues as quantum tools mature.
- Comparing predictions against direct developer surveys would measure how well the model aligns with self-reported challenge priorities.
Load-bearing premise
The six categories identified through content analysis and grounded theory together with the human-plus-ChatGPT annotations supply an accurate and unbiased ground-truth labeling of the extracted posts.
What would settle it
A new round of independent annotations by a different set of quantum developers on a held-out collection of posts that produces substantially different category assignments or drops transformer accuracy below 90 percent.
read the original abstract
Quantum Software Engineering (QSE) is a research area practiced by tech firms. Quantum developers face challenges in optimizing quantum computing and QSE concepts. They use Stack Overflow (SO) to discuss challenges and label posts with specialized quantum tags, which often refer to technical aspects rather than developer posts. Categorizing questions based on quantum concepts can help identify frequent QSE challenges. We conducted studies to classify questions into various challenges. We extracted 2829 questions from Q&A platforms using quantum-related tags. Posts were analyzed to identify frequent challenges and develop a novel grounded theory. Challenges include Tooling, Theoretical, Learning, Conceptual, Errors, and API Usage. Through content analysis and grounded theory, discussions were annotated with common challenges to develop a ground truth dataset. ChatGPT validated human annotations and resolved disagreements. Fine-tuned transformer algorithms, including BERT, DistilBERT, and RoBERTa, classified discussions into common challenges. We achieved an average accuracy of 95% with BERT DistilBERT, compared to fine-tuned Deep and Machine Learning (D&ML) classifiers, including Feedforward Neural Networks (FNN), Convolutional Neural Networks (CNN), and Long Short-Term Memory networks (LSTM), which achieved accuracies of 89%, 86%, and 84%, respectively. The Transformer-based approach outperforms the D&ML-based approach with a 6\% increase in accuracy by processing actual discussions, i.e., without data augmentation. We applied SHAP (SHapley Additive exPlanations) for model interpretability, revealing how linguistic features drive predictions and enhancing transparency in classification. These findings can help quantum vendors and forums better organize discussions for improved access and readability. However,empirical evaluation studies with actual developers and vendors are needed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extracts 2829 quantum-tagged posts from Q&A platforms, derives six challenge categories (Tooling, Theoretical, Learning, Conceptual, Errors, API Usage) via content analysis and grounded theory, creates ground-truth labels through human annotation validated and reconciled by ChatGPT, and fine-tunes transformer models (BERT, DistilBERT, RoBERTa) to classify posts. It reports an average accuracy of 95% for the transformer models, a 6% improvement over fine-tuned FNN (89%), CNN (86%), and LSTM (84%) baselines on the unaugmented data, and applies SHAP to interpret linguistic features driving predictions.
Significance. If the reported performance holds under proper validation, the work provides a practical, interpretable tool for organizing developer discussions on quantum software challenges, which could benefit forums and vendors. Credit is due for evaluating on real (non-augmented) data and for incorporating SHAP explanations to enhance transparency. The central performance claim, however, rests on the unverified reliability of the human-plus-ChatGPT labels.
major comments (3)
- [Section 3] Section 3 (Annotation and Ground-Truth Creation): No inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa, or raw agreement percentages) or breakdown of disagreement rates resolved by ChatGPT are reported. Because the 95% accuracy and 6% lift are measured against these labels, the absence of agreement metrics directly undermines interpretability of the headline result.
- [Section 4] Section 4 (Experimental Setup and Evaluation): The manuscript provides no information on the train/test split ratio, class-balance statistics across the six categories in the 2829-post corpus, or any statistical significance testing (e.g., McNemar’s test or bootstrap confidence intervals) for the reported accuracy differences. These omissions are load-bearing for the claim that transformers outperform the D&ML baselines.
- [Section 5] Section 5 (Results): The comparison to FNN, CNN, and LSTM baselines does not specify whether identical hyperparameter search, early-stopping criteria, or data-preprocessing steps were applied, making it difficult to attribute the 6% gap solely to model architecture rather than experimental differences.
minor comments (2)
- [Abstract] Abstract: Typo in the final sentence (“However,empirical” should read “However, empirical”).
- [Section 3] The description of how the six categories were finalized from the grounded-theory analysis could be clarified with a brief example of a post-to-category mapping.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments point by point below and outline the revisions we will implement to enhance the manuscript's rigor and transparency.
read point-by-point responses
-
Referee: [Section 3] Section 3 (Annotation and Ground-Truth Creation): No inter-annotator agreement statistics (Cohen’s kappa, Fleiss’ kappa, or raw agreement percentages) or breakdown of disagreement rates resolved by ChatGPT are reported. Because the 95% accuracy and 6% lift are measured against these labels, the absence of agreement metrics directly undermines interpretability of the headline result.
Authors: We acknowledge the importance of reporting inter-annotator agreement to validate the ground-truth labels. Our process involved multiple human annotators performing content analysis, with ChatGPT used to validate annotations and resolve any disagreements. In the revised version, we will include inter-annotator agreement statistics such as Cohen's kappa and Fleiss' kappa where applicable, along with a detailed breakdown of disagreement rates and how they were resolved by ChatGPT. This addition will strengthen the credibility of our labeled dataset. revision: yes
-
Referee: [Section 4] Section 4 (Experimental Setup and Evaluation): The manuscript provides no information on the train/test split ratio, class-balance statistics across the six categories in the 2829-post corpus, or any statistical significance testing (e.g., McNemar’s test or bootstrap confidence intervals) for the reported accuracy differences. These omissions are load-bearing for the claim that transformers outperform the D&ML baselines.
Authors: We agree that these details are crucial for assessing the robustness of our results. We will update Section 4 to specify the train/test split ratio employed, present the class-balance statistics for the six categories in the corpus of 2829 posts, and incorporate statistical significance testing, including McNemar's test or bootstrap confidence intervals, to confirm the significance of the performance improvements. revision: yes
-
Referee: [Section 5] Section 5 (Results): The comparison to FNN, CNN, and LSTM baselines does not specify whether identical hyperparameter search, early-stopping criteria, or data-preprocessing steps were applied, making it difficult to attribute the 6% gap solely to model architecture rather than experimental differences.
Authors: We appreciate this observation regarding experimental fairness. All models were evaluated under consistent conditions, including the same data preprocessing pipeline and hyperparameter tuning approaches. In the revised manuscript, we will provide explicit details on the hyperparameter search, early-stopping criteria, and preprocessing steps applied uniformly to the transformer models and the D&ML baselines (FNN, CNN, LSTM). This will clarify that the performance differences are due to the model architectures. revision: yes
Circularity Check
No circularity: standard supervised classification on independently derived labels
full rationale
The paper first extracts 2829 posts via quantum tags, then applies content analysis and grounded theory to identify six challenge categories and produce human-plus-ChatGPT annotations as ground truth. Transformer models (BERT, DistilBERT) are subsequently fine-tuned on this labeled set and evaluated on held-out posts to report 95% accuracy versus 89/86/84% for FNN/CNN/LSTM baselines. No equations, fitted parameters, or self-citations reduce the reported accuracies to the input labels by construction; the taxonomy precedes modeling and the performance numbers measure generalization on unseen data rather than tautological reproduction of the annotation process.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-tuning hyperparameters
axioms (2)
- domain assumption Pre-trained transformer models can be successfully fine-tuned for multi-class text classification on domain-specific forum posts
- ad hoc to paper ChatGPT can reliably resolve annotation disagreements and validate human labels for QSE challenge categories
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieved an average accuracy of 95% with BERT DistilBERT... The Transformer-based approach outperforms the D&ML-based approach with a 6% increase in accuracy by processing actual discussions, i.e., without data augmentation.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Challenges include Tooling, Theoretical, Learning, Conceptual, Errors, and API Usage... ChatGPT validated human annotations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Quantum computation and quantum information
Nielsen M A, Chuang I L. Quantum computation and quantum information. Cambridge university press, 2010
work page 2010
-
[2]
Preskill J. Reliable quantum computers. Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, 1998, 454(1969): 385–410
work page 1998
-
[3]
Quantum supremacy using a pro- grammable superconducting processor
Arute F, Arya K, Babbush R, Bacon D, Bardin J C, Barends R, Biswas R, Boixo S, Brandao F G, Buell D A, others . Quantum supremacy using a pro- grammable superconducting processor. Nature, 2019, 574(7779): 505–510
work page 2019
- [4]
- [6]
-
[7]
The talavera manifesto for quantum software engineering and pro- gramming
Piattini M, Peterssen G, P ´erez-Castillo R, Hevia J L, Serrano M A, Hern ´andez G, De Guzm ´an I G R, Pa- radela C A, Polo M, Murina E, others . The talavera manifesto for quantum software engineering and pro- gramming. In: QANSWER. 2020, 1–5
work page 2020
-
[8]
Murillo J M, Garcia-Alonso J, Moguel E, Barzen J, Leymann F, Ali S, Yue T, Arcaini P, P ´erez-Castillo R, Guzm´an G.-R. d I, others . Quantum software engineer- ing: Roadmap and challenges ahead. ACM Transac- tions on Software Engineering and Methodology, 2025, 34(5): 1–48
work page 2025
-
[9]
Mandal A K, Nadim M, Roy C K, Roy B, Schneider K A. Quantum software engineering and potential of quantum computing in software engineering research: a review. Automated Software Engineering, 2025, 32(1): 27
work page 2025
-
[10]
Quantum software engineering: Landscapes and horizons,
Zhao J. Quantum software engineering: Landscapes and horizons. arXiv preprint arXiv:2007.07047, 2020
-
[11]
Q# enabling scalable quantum computing and development with a high-level dsl
Svore K, Geller A, Troyer M, Azariah J, Granade C, Heim B, Kliuchnikov V , Mykhailova M, Paz A, Roet- teler M. Q# enabling scalable quantum computing and development with a high-level dsl. In: Proceedings of the real world domain specific languages workshop
-
[12]
Scaffcc: Scalable compila- tion and analysis of quantum programs
JavadiAbhari A, Patil S, Kudrow D, Heckey J, Lvov A, Chong F T, Martonosi M. Scaffcc: Scalable compila- tion and analysis of quantum programs. Parallel Com- puting, 2015, 45: 2–17
work page 2015
- [13]
-
[14]
Accessed: 2025-03-28
work page 2025
-
[15]
Li H, Khomh F, Openja M, others . Understanding quantum software engineering challenges an empirical study on stack exchange forums and github issues. In: 2021 IEEE International Conference on Software Main- tenance and Evolution (ICSME). 2021, 343–354
work page 2021
-
[16]
Towards quantum software requirements engineering
Yue T, Ali S, Arcaini P. Towards quantum software requirements engineering. In: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE). 2023, 161–164
work page 2023
-
[17]
Towards process centered architecting for quantum software systems
Ahmad A, Khan A A, Waseem M, Fahmideh M, Mikkonen T. Towards process centered architecting for quantum software systems. In: 2022 IEEE international conference on quantum software (QSW). 2022, 26–31
work page 2022
-
[18]
Mining q&a platforms for empirical evidence on quantum software programming
Khan A A, Ye B, Akbar M A, Khan J A, Mougouei D, Ma X. Mining q&a platforms for empirical evidence on quantum software programming. arXiv preprint arXiv:2503.05240, 2025
-
[19]
A systematic decision- making framework for tackling quantum software engi- neering challenges
Akbar M A, Khan A A, Rafi S. A systematic decision- making framework for tackling quantum software engi- neering challenges. Automated Software Engineering, 2023, 30(2): 22
work page 2023
-
[20]
Quantum software testing: A brief intro- duction
Ali S, Yue T. Quantum software testing: A brief intro- duction. In: 2023 IEEE/ACM 45th International Con- ference on Software Engineering: Companion Proceed- ings (ICSE-Companion). 2023, 332–333
work page 2023
-
[21]
A survey on mining stack overflow: question and answering (q&a) commu- nity
Ahmad A, Feng C, Ge S, Yousif A. A survey on mining stack overflow: question and answering (q&a) commu- nity. Data Technologies and Applications, 2018, 52(2): 190–247
work page 2018
-
[22]
How do oss developers reuse architectural solutions from q&a sites: An empir- ical study
Dieu d M J, Liang P, Shahin M. How do oss developers reuse architectural solutions from q&a sites: An empir- ical study. IEEE Transactions on Software Engineering, Nek Dil Khan et al. An Improved Quantum Software Challenges Classification Approach using Transfer Learning and Explainable AI 37 2025
work page 2025
-
[23]
Insights into software development approaches: min- ing q &a repositories
Khan A A, Khan J A, Akbar M A, Zhou P, Fahmideh M. Insights into software development approaches: min- ing q &a repositories. Empirical Software Engineering, 2024, 29(1): 8
work page 2024
-
[24]
Li T, Zhang X, Wang Y , Zhou Q, Wang Y , Dong F. Ma- chine learning for requirements engineering (ml4re): A systematic literature review complemented by practi- tioners’ voices from stack overflow. Information and Software Technology, 2024, 172: 107477
work page 2024
-
[25]
Beyer S, Macho C, Di Penta M, Pinzger M. What kind of questions do developers ask on stack overflow? a comparison of automated approaches to classify posts into question categories. Empirical Software Engineer- ing, 2020, 25: 2258–2301
work page 2020
-
[26]
Husain M, Khan M S, Khan J A, Khan N D, Khan A, Akbar M A. Exploring developers discussion forums for quantum software engineering: A fine-grained clas- sification approach using large language model (chat- gpt). In: Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineer- ing. 2025, 1742–1755
work page 2025
-
[27]
Quantum computing: A taxonomy, sys- tematic review and future directions
Gill S S, Kumar A, Singh H, Singh M, Kaur K, Usman M, Buyya R. Quantum computing: A taxonomy, sys- tematic review and future directions. Software: Practice and Experience, 2022, 52(1): 66–114
work page 2022
-
[28]
Vietz D, Barzen J, Leymann F, Wild K. On decision support for quantum application developers: catego- rization, comparison, and analysis of existing technolo- gies. In: International Conference on Computational Science. 2021, 127–141
work page 2021
-
[29]
Haghparast M, Mikkonen T, Nurminen J K, Stirbu V . Quantum software engineering challenges from devel- opers’ perspective: Mapping research challenges to the proposed workflow model. In: 2023 IEEE International Conference on Quantum Computing and Engineering (QCE). 2023, 173–176
work page 2023
-
[30]
How do program- mers ask and answer questions on the web?(nier track)
Treude C, Barzilay O, Storey M A. How do program- mers ask and answer questions on the web?(nier track). In: Proceedings of the 33rd international conference on software engineering. 2011, 804–807
work page 2011
-
[31]
Deep learning-based correct answer pre- diction for developer forums
Iftikhar H U, Rehman A U, Kalugina O A, Umer Q, Khan H A. Deep learning-based correct answer pre- diction for developer forums. IEEE Access, 2021, 9: 128166–128177
work page 2021
-
[32]
An em- pirical study of question discussions on stack overflow
Zhu W, Zhang H, Hassan A E, Godfrey M W. An em- pirical study of question discussions on stack overflow. Empirical Software Engineering, 2022, 27(6): 148
work page 2022
-
[33]
Khan J A, Yasin A, Fatima R, Vasan D, Khan A A, Khan A W. Valuating requirements arguments in the online user’s forum for requirements decision-making: the crowdre-varg framework. Software: Practice and Experience, 2022, 52(12): 2537–2573
work page 2022
-
[34]
Requirements knowledge ac- quisition from online user forums
Ali Khan J, Liu L, Wen L. Requirements knowledge ac- quisition from online user forums. Iet Software, 2020, 14(3): 242–253
work page 2020
-
[35]
A manual categorization of an- droid app development issues on stack overflow
Beyer S, Pinzger M. A manual categorization of an- droid app development issues on stack overflow. In: 2014 IEEE International Conference on Software Main- tenance and Evolution. 2014, 531–535
work page 2014
-
[36]
Basics of qualitative research tech- niques
Strauss A, Corbin J. Basics of qualitative research tech- niques. 1998
work page 1998
-
[37]
The content analysis guidebook
Neuendorf K A. The content analysis guidebook. sage, 2017
work page 2017
-
[38]
Khan N D, Khan J A, Li J, Ullah T, Zhao Q. Mining software insights: uncovering the frequently occurring issues in low-rating software applications. PeerJ Com- puter Science, 2024, 10: e2115
work page 2024
-
[39]
Khan N D, Khan J A, Li J, Ullah T, Zhao Q. Leveraging large language model chatgpt for enhanced understand- ing of end-user emotions in social media feedbacks. Ex- pert Systems with Applications, 2025, 261: 125524
work page 2025
-
[40]
What are mobile developers asking about? a large scale study using stack overflow
Rosen C, Shihab E. What are mobile developers asking about? a large scale study using stack overflow. Empir- ical Software Engineering, 2016, 21: 1192–1223
work page 2016
-
[41]
Why, when, and what: analyz- ing stack overflow questions by topic, type, and code
Allamanis M, Sutton C. Why, when, and what: analyz- ing stack overflow questions by topic, type, and code. In: 2013 10th Working conference on mining software repositories (MSR). 2013, 53–56
work page 2013
-
[42]
Automatic mining of opinions ex- pressed about apis in stack overflow
Uddin G, Khomh F. Automatic mining of opinions ex- pressed about apis in stack overflow. IEEE Transactions on Software Engineering, 2019, 47(3): 522–559
work page 2019
-
[43]
Predicting the programming language: Extracting knowledge from stack overflow posts
Baquero J F, Camargo J E, Restrepo-Calle F, Aponte J H, Gonz ´alez F A. Predicting the programming language: Extracting knowledge from stack overflow posts. In: Advances in Computing: 12th Colombian Conference, CCC 2017, Cali, Colombia, September 19- 22, 2017, Proceedings 12. 2017, 199–210
work page 2017
-
[44]
Bug severity prediction using question-and-answer pairs from stack overflow
Tan Y , Xu S, Wang Z, Zhang T, Xu Z, Luo X. Bug severity prediction using question-and-answer pairs from stack overflow. Journal of Systems and Software, 2020, 165: 110567
work page 2020
-
[45]
Bert: Pre- training of deep bidirectional transformers for language understanding
Devlin J, Chang M W, Lee K, Toutanova K. Bert: Pre- training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technolo- gies, volume 1 (long and short papers). 2019, 4171– 4186
work page 2019
-
[46]
Compar- ing bert against traditional machine learning text classi- fication
Gonz ´alez-Carvajal S, Garrido-Merch ´an E C. Compar- ing bert against traditional machine learning text classi- fication. arXiv preprint arXiv:2005.13012, 2020 38 Front. Comput. Sci., 2025, 0(0): 1–40
-
[47]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Sanh V , Debut L, Chaumond J, Wolf T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[48]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu Y , Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[49]
Bug classification in quantum software: A rule-based framework and its evaluation
Yousuf M M, Sofi S A. Bug classification in quantum software: A rule-based framework and its evaluation. arXiv preprint arXiv:2506.10397, 2025
-
[50]
Aktar M S, Liang P, Waseem M, Tahir A, Ahmad A, Zhang B, Li Z. Architecture decisions in quantum soft- ware systems: An empirical study on stack exchange and github. Information and Software Technology, 2025, 177: 107587
work page 2025
-
[51]
Automated Code Recommendation System
Upadhyay K, Chhetri V , Siddique A, Farooq U. Analyz- ing the evolution and maintenance of quantum software repositories. arXiv preprint arXiv:2501.06894, 2025 NEK DIL KHAN received his B.Sc. degree in software engi- neering from the University of Science and Technology Bannu, Khyber Pakhtunkhwa, Pakistan. He continued to pursue his pas- sion and earned h...
-
[52]
He has published over 130 research articles in well- reputed journals and international conferences. He taught and designed several core and advanced courses in software engineering and has been recognized with excellence in teaching, excellence in instructional tech- nology, and excellence in academic advising awards. His multidisciplinary research integ...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.