Recognition: unknown
RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow
Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3
The pith
RAG-Reflect predicts which Stack Overflow comments trigger code edits by combining retrieval with self-reflection on pre-generated rules, reaching 0.78 F1 without task-specific training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RAG-Reflect achieves Precision = 0.81, Recall = 0.74, and F1 = 0.78 on the SOUP benchmark for valid comment-edit prediction by using a three-stage workflow: a one-time Interpretation module that extracts validation rules from the knowledge base, followed at runtime by retrieval of contextual examples, reasoning about comment-edit causality, and reflection against the pre-established rules; the resulting performance exceeds traditional baselines and approaches that of fine-tuned models without any task-specific training.
What carries the argument
The three-stage runtime workflow built on a one-time Interpretation module that generates validation rules, then performs retrieval of contextual examples, reasoning about causality, and reflection on decisions using those rules.
If this is right
- Retrieval and reflection each contribute measurable gains, as shown by ablation studies that isolate their effects.
- The same modular structure can be reused across different maintenance tasks without retraining the underlying language model.
- Performance remains competitive with fine-tuned models while eliminating the cost of task-specific data collection and training.
- Traditional supervised baselines such as logistic regression and gradient boosting are outperformed on the same benchmark.
Where Pith is reading between the lines
- The one-time rule generation step could be refreshed periodically on new data to handle evolving comment patterns without full retraining.
- Similar retrieve-reflect loops might reduce annotation needs for other software-engineering tasks that involve informal user feedback.
- If the reflection rules prove brittle on new platforms, the framework offers a clear place to insert lightweight rule updates rather than new labeled datasets.
Load-bearing premise
The validation rules produced by the single Interpretation pass stay effective and general across the full range of comments and code seen later, and the LLM can apply them correctly during reflection without further adaptation.
What would settle it
A fresh collection of Stack Overflow posts where the fixed validation rules cause the reflection stage to misclassify comment-edit causality, dropping the F1 score below 0.70.
Figures
read the original abstract
User comments on online programming platforms such as Stack Overflow play a vital role in maintaining the correctness and relevance of shared code examples. However, the majority of comments express gratitude or clarification, while only a small fraction highlight actionable issues that drive meaningful edits. This paper demonstrates how agentic AI principles can revolutionize software maintenance tasks by presenting RAG-Reflect, a modular framework that achieves fine-tuned-level performance for valid comment-edit prediction without task-specific training. Valid Comment-Edit Prediction (VCP) is the task of determining whether a user comment directly triggered a subsequent code edit. The framework integrates large language models (LLMs) with retrieval-augmented reasoning and self-reflection mechanisms. RAG-Reflect operates through a three-stage runtime workflow built on a one-time pattern analysis phase. During initialization, an Interpretation module analyzes the knowledge base to generate validation rules. At inference time, the system (1) retrieves contextual examples, (2) reasons about comment-edit causality, and (3) reflects on decisions using the pre-established rules. We evaluate RAG-Reflect on the publicly available SOUP benchmark, achieving Precision = 0.81, Recall = 0.74, and F1 = 0.78, outperforming traditional baselines (e.g., Logistic Regression, XGBoost, different prompting techniques) and closely approaching the performance of fine-tuned models (F1 = 0.773) without retraining. Our ablation and stage-level analyses show that both retrieval and reflection modules substantially enhance performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RAG-Reflect, a modular agentic framework for Valid Comment-Edit Prediction (VCP) that combines a one-time Interpretation module (to derive validation rules from the full SOUP knowledge base) with a three-stage inference pipeline of retrieval, reasoning about comment-edit causality, and reflection against the pre-generated rules. It reports Precision=0.81, Recall=0.74, F1=0.78 on the public SOUP benchmark, outperforming traditional baselines (Logistic Regression, XGBoost, prompting) and approaching fine-tuned model performance (F1=0.773) without task-specific training; ablations attribute gains to the retrieval and reflection stages.
Significance. If the performance holds under a properly isolated evaluation, the result would be significant for software engineering applications: it shows that a one-time pattern-analysis plus agentic retrieve-reason-reflect loop can reach near-supervised F1 on comment-driven code maintenance using only an off-the-shelf LLM and a public benchmark, without retraining. The explicit ablation and stage-level analyses, together with the public SOUP benchmark and comparisons against both classical ML and fine-tuned baselines, provide concrete, reproducible evidence for the contribution of the reflection mechanism.
major comments (1)
- [Initialization phase (one-time Interpretation module)] Initialization / one-time pattern analysis phase: the Interpretation module is described as analyzing the entire SOUP knowledge base to produce validation rules that are then used at inference time in the reflection stage. No train/test split is mentioned prior to rule generation, so the rules may encode statistical regularities or edge cases from the evaluation data; this directly threatens the central claim that the F1=0.78 is achieved 'without task-specific training' and generalizes. A minimal fix is to regenerate rules exclusively on training folds and re-report all metrics and ablations.
minor comments (2)
- [Evaluation] Evaluation section: the reported F1=0.78 vs. fine-tuned F1=0.773 difference is presented without statistical significance tests, confidence intervals, or error analysis; adding these would strengthen the claim that the agentic pipeline is competitive.
- [Ablation studies] The abstract and methods refer to 'ablation and stage-level analyses' but do not specify the exact metrics or controls used for the 'reflection helps' result; a table or figure with per-stage contributions would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable comments on our manuscript. We particularly appreciate the identification of the potential issue with the one-time Interpretation module. Below, we provide a point-by-point response and outline the revisions we will make to address this concern.
read point-by-point responses
-
Referee: [Initialization phase (one-time Interpretation module)] Initialization / one-time pattern analysis phase: the Interpretation module is described as analyzing the entire SOUP knowledge base to produce validation rules that are then used at inference time in the reflection stage. No train/test split is mentioned prior to rule generation, so the rules may encode statistical regularities or edge cases from the evaluation data; this directly threatens the central claim that the F1=0.78 is achieved 'without task-specific training' and generalizes. A minimal fix is to regenerate rules exclusively on training folds and re-report all metrics and ablations.
Authors: We acknowledge the validity of this concern. The manuscript does not explicitly state that the Interpretation module's rule generation was restricted to training data, which could allow for unintended incorporation of test set information. We will revise the paper to implement the suggested minimal fix: regenerate the validation rules using only the training folds of the SOUP benchmark. All experiments, including the main results and ablations, will be re-run and reported with this change. This will strengthen our assertion that the approach achieves strong performance without task-specific training on unseen data. revision: yes
Circularity Check
No circularity; empirical results on external public benchmark with independent baselines
full rationale
The paper reports experimental performance (Precision 0.81, Recall 0.74, F1 0.78) of the RAG-Reflect framework on the publicly available SOUP benchmark, with comparisons to traditional baselines and fine-tuned models. The one-time Interpretation module generates validation rules from the knowledge base as part of the described workflow, but this is an empirical setup rather than a mathematical derivation or first-principles prediction that reduces to its inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would make the central performance claims equivalent to the inputs. The evaluation uses an external benchmark and reports direct measurements, making the results self-contained against independent references.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models possess sufficient reasoning capability to determine comment-edit causality when supplied with retrieved examples and explicit validation rules.
Forward citations
Cited by 1 Pith paper
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Reference graph
Works this paper leans on
-
[1]
Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B. Divya. 2025. Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey.IEEE Access13 (2025), 18912–18936. doi:10.1109/ACCESS.2025.3532853
-
[2]
Muhammad Azeem Akbar, Arif Ali Khan, Muhammad Hamza, Abdullah Ghaffar, and Kari Smolander. 2025. Agentic AI in Software Engineering: Practitioner Perspectives Across the Software Development Life Cycle.SSRN preprint 5520159(2025). doi:10.2139/ssrn.5520159
-
[3]
Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering User Behavior with Badges. InProc. of the 22nd Int. World Wide Web Conf. (WWW). 95–106. doi:10.1145/2488388.2488398
-
[4]
Darko Androcec. 2020. Machine learning methods for toxic comment classification: a systematic review.Acta Universitatis Sapientiae, Informatica12, 2 (2020), 205–216
2020
-
[5]
Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.arXiv preprint arXiv:2310.11511(2023). doi:10.48550/arXiv.2310.11511
work page internal anchor Pith review doi:10.48550/arxiv.2310.11511 2023
-
[6]
Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl. 2018. SOTorrent: reconstructing and analyzing the evolution of stack overflow posts. InProceedings of the 15th International Conference on Mining Software Repositories (ICSE ’18). ACM, 319–330. doi:10.1145/3196398.3196430
-
[7]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 1877–1901
2020
-
[8]
Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the Community & For the Community: A Deep Learning Approach to Assist Collaborative Editing in Q&A Sites.Proc. ACM Hum.-Comput. Interact. (CSCW)1, 2 (2017). doi:10.1145/3134667
-
[9]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021). https: //arxiv.org/abs/2107.03374
work page internal anchor Pith review arXiv 2021
-
[10]
Xinyun Chen and et al. 2018. Tree-to-Tree Neural Networks for Program Translation.Advances in Neural Information Processing Systems(2018), 2547–2557
2018
-
[11]
Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin70 (1968), 213–220. https://api.semanticscholar.org/CorpusID:29694079
1968
-
[12]
Manaal Faruqui and Dipanjan Das. 2018. Identifying Well-formed Natural Language Questions. InProc. of the 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP). 798–803. doi:10.18653/v1/D18-1093
-
[13]
Towards Learning Boulder Excavation with Hydraulic Excavators
Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. InarXiv:2509.06216 [cs.SE]. doi:10.48550/arXiv.2 509.06216
-
[14]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685
work page internal anchor Pith review arXiv 2021
-
[15]
Meiqun Hu, Ee-Peng Lim, Aixin Sun, Hady W. Lauw, and Ba-Quy Vuong. 2007. Measuring Article Quality in Wikipedia: Models and Evaluation. InProc. of the 16th ACM Conf. on Information and Knowledge Management (CIKM). 243–252. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025. 0:26 Shanto, et al. doi:10.1145/1321440.1321476
- [16]
- [17]
-
[18]
Harsh Kajla, Jatin Hooda, Gajanand Saini, et al. 2020. Classification of online toxic comments using machine learning algorithms. In2020 4th international conference on intelligent computing and control systems (ICICCS). IEEE, 1119–1123
2020
-
[19]
Eirini Kalliamvakou, Daniela Damian, Kelly Blincoe, Leif Singer, and Daniel M. German. 2015. Open Source-Style Collaborative Development Practices in Commercial Projects Using GitHub. InProc. of the 37th Int. Conf. on Software Engineering (ICSE). 574–585. doi:10.1109/ICSE.2015.74
-
[20]
Aniket Kittur and Robert E. Kraut. 2008. Harnessing the Wisdom of Crowds in Wikipedia: Quality Through Coordina- tion. InProc. of the ACM 2008 Conf. on Human Factors in Computing Systems (CHI). 37–46. doi:10.1145/1357054.1357062
-
[21]
Patrick Lewis, Boyd Oguz, and et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems. 9459–9474
2020
-
[22]
Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabien Petroni, Vladimir Karpukhin, Naman Goyal, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems 33 (NeurIPS)
2020
-
[23]
Guo Li, Haiyi Zhu, Tun Lu, Xianghua Ding, and Ning Gu. 2015. Is It Good to Be Like Wikipedia?: Exploring the Trade-offs of Introducing Collaborative Editing Model to Q&A Sites. InProc. of the 18th ACM Conf. on Computer Supported Cooperative Work & Social Computing (CSCW). 1080–1091. doi:10.1145/2675133.2675155
-
[24]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv preprint arXiv:2507.15003. doi:10.48550/arX iv.2507.15003
-
[25]
Lei Li, Daqing He, Wei Jeng, Spencer Goodwin, and Chengzhi Zhang. 2015. Answer Quality Characteristics and Prediction on an Academic Q&A Site: A Case Study on ResearchGate. InCompanion Proc. of the 24th Int. World Wide Web Conf. (WWW). 1453–1458. doi:10.1145/2740908.2742129
-
[26]
Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InWorkshop on Text Summarization Branches Out. 74–81
2004
-
[27]
Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, and Jianling Sun. 2025. Towards Better Answers: Automated Stack Overflow Post Updating. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 591–603. doi:10.1109/ICSE55347.2025.00024
-
[28]
Saikat Mondal, Gias Uddin, and Chanchal K. Roy. 2023. Automatic Prediction of Rejected Edits in Stack Overflow. Empirical Software Engineering28, 1 (2023). doi:10.1007/s10664-022-10242-2
-
[29]
Andrew Ng. 2024. Practical Tips for Building Agentic AI Workflows (Modules M1–M5). Lecture Series, DeepLearning.AI. Available via the Agentic AI course on DeepLearning.AI
2024
-
[30]
Meethun Panda. 2025. Agentic RAG: Redefining Retrieval-Augmented Generation for Adaptive Intelligence.Interna- tional Research Journal of Engineering and Technology (IRJET)12, 1 (2025), 731–739
2025
-
[31]
Md Rafiuzzaman Parvez, Earl T. Barr, and Collin McMillan. 2021. REDCODER: Retrieval-augmented code generation and summarization. arXiv preprint arXiv:2108.11601. https://arxiv.org/abs/2108.11601
-
[32]
M. Masudur Rahman and Chanchal K. Roy. 2015. An Insight into the Unresolved Questions at Stack Overflow. InProc. of the 12th IEEE/ACM Working Conf. on Mining Software Repositories (MSR). 426–429. doi:10.1109/MSR.2015.55
-
[33]
Shuo Ren, Daya Guo, Shuai Deng, Sifei Lu, and et al. 2020. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. InNeural Information Processing Systems Workshop on Evaluation for Code Generation
2020
-
[34]
Abhik Roychoudhury. 2025. Agentic AI for Software: Thoughts from the Software Engineering Community.arXiv preprint arXiv:2508.17343(2025). doi:10.48550/arXiv.2508.17343
-
[35]
Abhik Roychoudhury, Corina Păsăreanu, Michael Pradel, and Baishakhi Ray. 2025. Agentic AI Software Engineers: Programming with Trust.arXiv preprint arXiv:2502.13767(2025). doi:10.48550/arXiv.2502.13767
-
[36]
Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Guido Lample, et al. 2023. Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/2308.12950
work page internal anchor Pith review arXiv 2023
-
[37]
Mikalai Sabel. 2007. Structuring Wiki Revision History. InProc. of the 2007 Int. Symposium on Wikis (WikiSym). 125–130. doi:10.1145/1296951.1296972
-
[38]
Revati Sharma and Meetkumar Patel. 2018. Toxic comment classification using neural networks and machine learning. Int. Adv. Res. J. Sci. Eng. Technol5, 9 (2018)
2018
-
[39]
Mohammad Sadegh Sheikhaei, Yuan Tian, and Shaowei Wang. 2023. A study of update request comments in Stack Overflow answer posts.Journal of Systems and Software198 (2023), 111590. doi:10.1016/j.jss.2022.111590 ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025. RAG-Reflect: Agentic Retrieval-Augmented Generation with...
-
[40]
Noah Shinn, Shunyu Yao, and Karthik Narasimhan. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.arXiv preprint arXiv:2303.11366(2023). https://arxiv.org/abs/2303.11366
work page internal anchor Pith review arXiv 2023
-
[41]
Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG.arXiv preprint arXiv:2501.09136(2025). doi:10.48550/arXiv.2501.09136
work page internal anchor Pith review doi:10.48550/arxiv.2501.09136 2025
-
[42]
Abhishek Soni and Sarah Nadi. 2019. Analyzing comment-induced updates on Stack Overflow. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 220–224. doi:10.1109/MSR.2019.00040
-
[43]
Henry Tang and Sarah Nadi. 2021. On using Stack Overflow comment-edit pairs to recommend code maintenance changes.Empirical Software Engineering26, 4 (2021), 68. doi:10.1007/s10664-021-09954-8
-
[44]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)
work page internal anchor Pith review arXiv 2023
-
[45]
Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. InProceedings of the 41st International Conference on Software Engineering (ICSE). 25–36. doi:10.1109/ICSE.2019.00021
-
[46]
Shaowei Wang, Tse-Hsun Chen, and Ahmed E Hassan. 2018. How do users revise answers on technical Q&A websites? A case study on stack overflow.IEEE Transactions on Software Engineering46, 9 (2018), 1024–1038
2018
-
[47]
Shaowei Wang, David Lo, and Lingxiao Jiang. 2013. An Empirical Study on Developer Interactions in StackOverflow. InProceedings of the 28th Annual ACM Symposium on Applied Computing (SAC). ACM, New York, NY, USA, 1019–1024. doi:10.1145/2480362.2480557
-
[48]
Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder–Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 8696–8708. doi:10.18653 /v1/2021.emnlp-main.685
2021
-
[49]
Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv preprint arXiv:2510.22780. doi:10.48550/arXiv.2510.22780
-
[50]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models.arXiv preprint arXiv:2201.11903(2022). https: //arxiv.org/abs/2201.11903
work page internal anchor Pith review arXiv 2022
-
[51]
Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How Do Developers Utilize Source Code from Stack Overflow?Empirical Software Engineering24, 2 (2019), 637–673. doi:10.1007/s10664-018-9634-5
-
[52]
Xiwei Xu, Dawen Zhang, Qing Liu, Qinghua Lu, and Liming Zhu. 2025. Agentic RAG with Human-in-the-Retrieval. In Proc. of the IEEE Int. Conf. on Software Architecture Companion (ICSA-C). 498–502. doi:10.1109/ICSA-C65153.2025.00074
-
[53]
Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press
John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 37 (NeurIPS). doi:10.52202/079017-1601
-
[54]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Nan Du, Izhak Shafran, Karthik Narasimhan, and Denny Zhou. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.arXiv preprint arXiv:2305.10601(2023). https: //arxiv.org/abs/2305.10601
work page internal anchor Pith review arXiv 2023
-
[55]
Pengcheng Yin and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. InInternational Conference on Mining Software Repositories (MSR). 476–486
2018
-
[56]
Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, and Ahmed E. Hassan. 2021. Are Comments on Stack Overflow Well Organized for Easy Retrieval by Developers?ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 22:1–22:31. doi:10.1145/3434279
-
[57]
Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, and Ahmed E. Hassan. 2021. Reading Answers on Stack Overflow: Not Enough!IEEE Transactions on Software Engineering47, 11 (2021), 2520–2533. doi:10.1109/TSE.2019.295 4319
-
[58]
Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, Ying Zou, and Ahmed E. Hassan. 2021. An Empirical Study of Obsolete Answers on Stack Overflow.IEEE Transactions on Software Engineering47, 4 (2021), 850–862. doi:10.1109/TSE.2019.2906315
- [59]
-
[60]
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.18223(2023). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025
work page internal anchor Pith review arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.