pith. machine review for the scientific record. sign in

arxiv: 2604.22217 · v1 · submitted 2026-04-24 · 💻 cs.SE

Recognition: unknown

RAG-Reflect: Agentic Retrieval-Augmented Generation with Reflections for Comment-Driven Code Maintenance on Stack Overflow

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:23 UTC · model grok-4.3

classification 💻 cs.SE
keywords retrieval-augmented generationcode maintenanceStack Overflowvalid comment-edit predictionself-reflectionlarge language modelssoftware engineeringagentic AI
0
0 comments X

The pith

RAG-Reflect predicts which Stack Overflow comments trigger code edits by combining retrieval with self-reflection on pre-generated rules, reaching 0.78 F1 without task-specific training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

User comments on Stack Overflow rarely drive code edits, yet identifying those that do supports ongoing maintenance of shared examples. RAG-Reflect addresses this by first running a one-time Interpretation module over a knowledge base to produce validation rules, then using those rules at inference time to guide a retrieve-reason-reflect cycle inside a large language model. On the SOUP benchmark the method records Precision 0.81, Recall 0.74 and F1 0.78, beating logistic regression, XGBoost and plain prompting while staying close to the 0.773 F1 of fine-tuned models. Ablation results show that both the retrieval and reflection stages add measurable value. The framework therefore demonstrates how agentic retrieval and reflection can substitute for retraining on narrow maintenance tasks.

Core claim

RAG-Reflect achieves Precision = 0.81, Recall = 0.74, and F1 = 0.78 on the SOUP benchmark for valid comment-edit prediction by using a three-stage workflow: a one-time Interpretation module that extracts validation rules from the knowledge base, followed at runtime by retrieval of contextual examples, reasoning about comment-edit causality, and reflection against the pre-established rules; the resulting performance exceeds traditional baselines and approaches that of fine-tuned models without any task-specific training.

What carries the argument

The three-stage runtime workflow built on a one-time Interpretation module that generates validation rules, then performs retrieval of contextual examples, reasoning about causality, and reflection on decisions using those rules.

If this is right

  • Retrieval and reflection each contribute measurable gains, as shown by ablation studies that isolate their effects.
  • The same modular structure can be reused across different maintenance tasks without retraining the underlying language model.
  • Performance remains competitive with fine-tuned models while eliminating the cost of task-specific data collection and training.
  • Traditional supervised baselines such as logistic regression and gradient boosting are outperformed on the same benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The one-time rule generation step could be refreshed periodically on new data to handle evolving comment patterns without full retraining.
  • Similar retrieve-reflect loops might reduce annotation needs for other software-engineering tasks that involve informal user feedback.
  • If the reflection rules prove brittle on new platforms, the framework offers a clear place to insert lightweight rule updates rather than new labeled datasets.

Load-bearing premise

The validation rules produced by the single Interpretation pass stay effective and general across the full range of comments and code seen later, and the LLM can apply them correctly during reflection without further adaptation.

What would settle it

A fresh collection of Stack Overflow posts where the fixed validation rules cause the reflection stage to misclassify comment-edit causality, dropping the F1 score below 0.70.

Figures

Figures reproduced from arXiv: 2604.22217 by Alioune Ngom, Mehedi Hasan Shanto, Muhammad Asaduzzaman.

Figure 1
Figure 1. Figure 1: An example of a comment that triggered the edit to a Stack Overflow answer. view at source ↗
Figure 2
Figure 2. Figure 2: An example of a question and the accepted answer in Stack Overflow. A comment associated to the view at source ↗
Figure 3
Figure 3. Figure 3: Motivating examples of comment-driven edits on Stack Overflow. view at source ↗
Figure 4
Figure 4. Figure 4: Agentic retrieval-augmented workflow for comment-driven code maintenance. The pipeline comprises view at source ↗
Figure 5
Figure 5. Figure 5: Retrieval–Reasoning prompt used in RAG-Reflect. The model analyzes the comment–code pair with view at source ↗
Figure 6
Figure 6. Figure 6: Rule-based Reflection prompt. The model validates causal consistency using explicit linguistic and view at source ↗
Figure 7
Figure 7. Figure 7: Zero-shot prompt template used for the Valid Comment–Edit Prediction (VCP) task. view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of Invalid and Valid F1-scores across multiple prompting strategies (Zero-Shot, 2-shot, view at source ↗
Figure 9
Figure 9. Figure 9: Failure Cases analysis of RAG-Reflect 8.2.1 Semantic Mismatch Between Comment and Code Change. A common failure occurs when the comment and the code modification appear lexically related but address different underlying issues. In such scenarios, the model fails to recognize that the patch fully implements the developer’s intent. Example (PostId 6923710). The comment highlights a “Type mismatch” involving … view at source ↗
Figure 10
Figure 10. Figure 10: prompt template used for the Automatic Post Update (APU) task (an extension to our RAG-Reflect). view at source ↗
read the original abstract

User comments on online programming platforms such as Stack Overflow play a vital role in maintaining the correctness and relevance of shared code examples. However, the majority of comments express gratitude or clarification, while only a small fraction highlight actionable issues that drive meaningful edits. This paper demonstrates how agentic AI principles can revolutionize software maintenance tasks by presenting RAG-Reflect, a modular framework that achieves fine-tuned-level performance for valid comment-edit prediction without task-specific training. Valid Comment-Edit Prediction (VCP) is the task of determining whether a user comment directly triggered a subsequent code edit. The framework integrates large language models (LLMs) with retrieval-augmented reasoning and self-reflection mechanisms. RAG-Reflect operates through a three-stage runtime workflow built on a one-time pattern analysis phase. During initialization, an Interpretation module analyzes the knowledge base to generate validation rules. At inference time, the system (1) retrieves contextual examples, (2) reasons about comment-edit causality, and (3) reflects on decisions using the pre-established rules. We evaluate RAG-Reflect on the publicly available SOUP benchmark, achieving Precision = 0.81, Recall = 0.74, and F1 = 0.78, outperforming traditional baselines (e.g., Logistic Regression, XGBoost, different prompting techniques) and closely approaching the performance of fine-tuned models (F1 = 0.773) without retraining. Our ablation and stage-level analyses show that both retrieval and reflection modules substantially enhance performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces RAG-Reflect, a modular agentic framework for Valid Comment-Edit Prediction (VCP) that combines a one-time Interpretation module (to derive validation rules from the full SOUP knowledge base) with a three-stage inference pipeline of retrieval, reasoning about comment-edit causality, and reflection against the pre-generated rules. It reports Precision=0.81, Recall=0.74, F1=0.78 on the public SOUP benchmark, outperforming traditional baselines (Logistic Regression, XGBoost, prompting) and approaching fine-tuned model performance (F1=0.773) without task-specific training; ablations attribute gains to the retrieval and reflection stages.

Significance. If the performance holds under a properly isolated evaluation, the result would be significant for software engineering applications: it shows that a one-time pattern-analysis plus agentic retrieve-reason-reflect loop can reach near-supervised F1 on comment-driven code maintenance using only an off-the-shelf LLM and a public benchmark, without retraining. The explicit ablation and stage-level analyses, together with the public SOUP benchmark and comparisons against both classical ML and fine-tuned baselines, provide concrete, reproducible evidence for the contribution of the reflection mechanism.

major comments (1)
  1. [Initialization phase (one-time Interpretation module)] Initialization / one-time pattern analysis phase: the Interpretation module is described as analyzing the entire SOUP knowledge base to produce validation rules that are then used at inference time in the reflection stage. No train/test split is mentioned prior to rule generation, so the rules may encode statistical regularities or edge cases from the evaluation data; this directly threatens the central claim that the F1=0.78 is achieved 'without task-specific training' and generalizes. A minimal fix is to regenerate rules exclusively on training folds and re-report all metrics and ablations.
minor comments (2)
  1. [Evaluation] Evaluation section: the reported F1=0.78 vs. fine-tuned F1=0.773 difference is presented without statistical significance tests, confidence intervals, or error analysis; adding these would strengthen the claim that the agentic pipeline is competitive.
  2. [Ablation studies] The abstract and methods refer to 'ablation and stage-level analyses' but do not specify the exact metrics or controls used for the 'reflection helps' result; a table or figure with per-stage contributions would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and valuable comments on our manuscript. We particularly appreciate the identification of the potential issue with the one-time Interpretation module. Below, we provide a point-by-point response and outline the revisions we will make to address this concern.

read point-by-point responses
  1. Referee: [Initialization phase (one-time Interpretation module)] Initialization / one-time pattern analysis phase: the Interpretation module is described as analyzing the entire SOUP knowledge base to produce validation rules that are then used at inference time in the reflection stage. No train/test split is mentioned prior to rule generation, so the rules may encode statistical regularities or edge cases from the evaluation data; this directly threatens the central claim that the F1=0.78 is achieved 'without task-specific training' and generalizes. A minimal fix is to regenerate rules exclusively on training folds and re-report all metrics and ablations.

    Authors: We acknowledge the validity of this concern. The manuscript does not explicitly state that the Interpretation module's rule generation was restricted to training data, which could allow for unintended incorporation of test set information. We will revise the paper to implement the suggested minimal fix: regenerate the validation rules using only the training folds of the SOUP benchmark. All experiments, including the main results and ablations, will be re-run and reported with this change. This will strengthen our assertion that the approach achieves strong performance without task-specific training on unseen data. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results on external public benchmark with independent baselines

full rationale

The paper reports experimental performance (Precision 0.81, Recall 0.74, F1 0.78) of the RAG-Reflect framework on the publicly available SOUP benchmark, with comparisons to traditional baselines and fine-tuned models. The one-time Interpretation module generates validation rules from the knowledge base as part of the described workflow, but this is an empirical setup rather than a mathematical derivation or first-principles prediction that reduces to its inputs by construction. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would make the central performance claims equivalent to the inputs. The evaluation uses an external benchmark and reports direct measurements, making the results self-contained against independent references.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can perform reliable causal reasoning about comment-edit pairs when augmented with retrieved examples and pre-generated validation rules; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Large language models possess sufficient reasoning capability to determine comment-edit causality when supplied with retrieved examples and explicit validation rules.
    Invoked throughout the description of the reasoning and reflection stages.

pith-pipeline@v0.9.0 · 5586 in / 1434 out tokens · 26289 ms · 2026-05-08T11:23:31.785332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents

    cs.CL 2026-05 unverdicted novelty 5.0

    An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.

Reference graph

Works this paper leans on

60 extracted references · 45 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    Deepak Bhaskar Acharya, Karthigeyan Kuppan, and B. Divya. 2025. Agentic AI: Autonomous Intelligence for Complex Goals—A Comprehensive Survey.IEEE Access13 (2025), 18912–18936. doi:10.1109/ACCESS.2025.3532853

  2. [2]

    Muhammad Azeem Akbar, Arif Ali Khan, Muhammad Hamza, Abdullah Ghaffar, and Kari Smolander. 2025. Agentic AI in Software Engineering: Practitioner Perspectives Across the Software Development Life Cycle.SSRN preprint 5520159(2025). doi:10.2139/ssrn.5520159

  3. [3]

    Ashton Anderson, Daniel Huttenlocher, Jon Kleinberg, and Jure Leskovec. 2013. Steering User Behavior with Badges. InProc. of the 22nd Int. World Wide Web Conf. (WWW). 95–106. doi:10.1145/2488388.2488398

  4. [4]

    Darko Androcec. 2020. Machine learning methods for toxic comment classification: a systematic review.Acta Universitatis Sapientiae, Informatica12, 2 (2020), 205–216

  5. [5]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection.arXiv preprint arXiv:2310.11511(2023). doi:10.48550/arXiv.2310.11511

  6. [6]

    Sebastian Baltes, Lorik Dumani, Christoph Treude, and Stephan Diehl. 2018. SOTorrent: reconstructing and analyzing the evolution of stack overflow posts. InProceedings of the 15th International Conference on Mining Software Repositories (ICSE ’18). ACM, 319–330. doi:10.1145/3196398.3196430

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. 2020. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems (NeurIPS)33 (2020), 1877–1901

  8. [8]

    Chunyang Chen, Zhenchang Xing, and Yang Liu. 2017. By the Community & For the Community: A Deep Learning Approach to Assist Collaborative Editing in Q&A Sites.Proc. ACM Hum.-Comput. Interact. (CSCW)1, 2 (2017). doi:10.1145/3134667

  9. [9]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, et al. 2021. Evaluating Large Language Models Trained on Code.arXiv preprint arXiv:2107.03374(2021). https: //arxiv.org/abs/2107.03374

  10. [10]

    Xinyun Chen and et al. 2018. Tree-to-Tree Neural Networks for Program Translation.Advances in Neural Information Processing Systems(2018), 2547–2557

  11. [11]

    Jacob Cohen. 1968. Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin70 (1968), 213–220. https://api.semanticscholar.org/CorpusID:29694079

  12. [12]

    Manaal Faruqui and Dipanjan Das. 2018. Identifying Well-formed Natural Language Questions. InProc. of the 2018 Conf. on Empirical Methods in Natural Language Processing (EMNLP). 798–803. doi:10.18653/v1/D18-1093

  13. [13]

    Towards Learning Boulder Excavation with Hydraulic Excavators

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap. InarXiv:2509.06216 [cs.SE]. doi:10.48550/arXiv.2 509.06216

  14. [14]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685 [cs.CL] https://arxiv.org/abs/2106.09685

  15. [15]

    Lauw, and Ba-Quy Vuong

    Meiqun Hu, Ee-Peng Lim, Aixin Sun, Hady W. Lauw, and Ba-Quy Vuong. 2007. Measuring Article Quality in Wikipedia: Models and Evaluation. InProc. of the 16th ACM Conf. on Information and Knowledge Management (CIKM). 243–252. ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025. 0:26 Shanto, et al. doi:10.1145/1321440.1321476

  16. [16]

    Shuyang Jiang, Yuhao Wang, and Yu Wang. 2023. SelfEvolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907. https://arxiv.org/abs/2306.02907

  17. [17]

    Xianhao Jin and Francisco Servant. 2019. What Edits Are Done on the Highly Answered Questions in Stack Overflow? An Empirical Study. InProc. of the 16th IEEE/ACM Int. Conf. on Mining Software Repositories (MSR). 225–229. doi:10.110 9/MSR.2019.00045

  18. [18]

    Harsh Kajla, Jatin Hooda, Gajanand Saini, et al. 2020. Classification of online toxic comments using machine learning algorithms. In2020 4th international conference on intelligent computing and control systems (ICICCS). IEEE, 1119–1123

  19. [19]

    Eirini Kalliamvakou, Daniela Damian, Kelly Blincoe, Leif Singer, and Daniel M. German. 2015. Open Source-Style Collaborative Development Practices in Commercial Projects Using GitHub. InProc. of the 37th Int. Conf. on Software Engineering (ICSE). 574–585. doi:10.1109/ICSE.2015.74

  20. [20]

    Aniket Kittur and Robert E. Kraut. 2008. Harnessing the Wisdom of Crowds in Wikipedia: Quality Through Coordina- tion. InProc. of the ACM 2008 Conf. on Human Factors in Computing Systems (CHI). 37–46. doi:10.1145/1357054.1357062

  21. [21]

    Patrick Lewis, Boyd Oguz, and et al. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems. 9459–9474

  22. [22]

    Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabien Petroni, Vladimir Karpukhin, Naman Goyal, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems 33 (NeurIPS)

  23. [23]

    Guo Li, Haiyi Zhu, Tun Lu, Xianghua Ding, and Ning Gu. 2015. Is It Good to Be Like Wikipedia?: Exploring the Trade-offs of Introducing Collaborative Editing Model to Q&A Sites. InProc. of the 18th ACM Conf. on Computer Supported Cooperative Work & Social Computing (CSCW). 1080–1091. doi:10.1145/2675133.2675155

  24. [24]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering. arXiv preprint arXiv:2507.15003. doi:10.48550/arX iv.2507.15003

  25. [25]

    Lei Li, Daqing He, Wei Jeng, Spencer Goodwin, and Chengzhi Zhang. 2015. Answer Quality Characteristics and Prediction on an Academic Q&A Site: A Case Study on ResearchGate. InCompanion Proc. of the 24th Int. World Wide Web Conf. (WWW). 1453–1458. doi:10.1145/2740908.2742129

  26. [26]

    Chin-Yew Lin. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. InWorkshop on Text Summarization Branches Out. 74–81

  27. [27]

    Yubo Mai, Zhipeng Gao, Haoye Wang, Tingting Bi, Xing Hu, Xin Xia, and Jianling Sun. 2025. Towards Better Answers: Automated Stack Overflow Post Updating. InProceedings of the 47th IEEE/ACM International Conference on Software Engineering (ICSE). 591–603. doi:10.1109/ICSE55347.2025.00024

  28. [28]

    Saikat Mondal, Gias Uddin, and Chanchal K. Roy. 2023. Automatic Prediction of Rejected Edits in Stack Overflow. Empirical Software Engineering28, 1 (2023). doi:10.1007/s10664-022-10242-2

  29. [29]

    Andrew Ng. 2024. Practical Tips for Building Agentic AI Workflows (Modules M1–M5). Lecture Series, DeepLearning.AI. Available via the Agentic AI course on DeepLearning.AI

  30. [30]

    Meethun Panda. 2025. Agentic RAG: Redefining Retrieval-Augmented Generation for Adaptive Intelligence.Interna- tional Research Journal of Engineering and Technology (IRJET)12, 1 (2025), 731–739

  31. [31]

    Barr, and Collin McMillan

    Md Rafiuzzaman Parvez, Earl T. Barr, and Collin McMillan. 2021. REDCODER: Retrieval-augmented code generation and summarization. arXiv preprint arXiv:2108.11601. https://arxiv.org/abs/2108.11601

  32. [32]

    Masudur Rahman and Chanchal K

    M. Masudur Rahman and Chanchal K. Roy. 2015. An Insight into the Unresolved Questions at Stack Overflow. InProc. of the 12th IEEE/ACM Working Conf. on Mining Software Repositories (MSR). 426–429. doi:10.1109/MSR.2015.55

  33. [33]

    Shuo Ren, Daya Guo, Shuai Deng, Sifei Lu, and et al. 2020. CodeBLEU: A Method for Automatic Evaluation of Code Synthesis. InNeural Information Processing Systems Workshop on Evaluation for Code Generation

  34. [34]

    Abhik Roychoudhury. 2025. Agentic AI for Software: Thoughts from the Software Engineering Community.arXiv preprint arXiv:2508.17343(2025). doi:10.48550/arXiv.2508.17343

  35. [35]

    Abhik Roychoudhury, Corina Păsăreanu, Michael Pradel, and Baishakhi Ray. 2025. Agentic AI Software Engineers: Programming with Trust.arXiv preprint arXiv:2502.13767(2025). doi:10.48550/arXiv.2502.13767

  36. [36]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Guido Lample, et al. 2023. Code Llama: Open foundation models for code. arXiv preprint arXiv:2308.12950. https://arxiv.org/abs/2308.12950

  37. [37]

    Mikalai Sabel. 2007. Structuring Wiki Revision History. InProc. of the 2007 Int. Symposium on Wikis (WikiSym). 125–130. doi:10.1145/1296951.1296972

  38. [38]

    Revati Sharma and Meetkumar Patel. 2018. Toxic comment classification using neural networks and machine learning. Int. Adv. Res. J. Sci. Eng. Technol5, 9 (2018)

  39. [39]

    Mohammad Sadegh Sheikhaei, Yuan Tian, and Shaowei Wang. 2023. A study of update request comments in Stack Overflow answer posts.Journal of Systems and Software198 (2023), 111590. doi:10.1016/j.jss.2022.111590 ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025. RAG-Reflect: Agentic Retrieval-Augmented Generation with...

  40. [40]

    Noah Shinn, Shunyu Yao, and Karthik Narasimhan. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning.arXiv preprint arXiv:2303.11366(2023). https://arxiv.org/abs/2303.11366

  41. [41]

    Aditi Singh, Abul Ehtesham, Saket Kumar, and Tala Talaei Khoei. 2025. Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG.arXiv preprint arXiv:2501.09136(2025). doi:10.48550/arXiv.2501.09136

  42. [42]

    Abhishek Soni and Sarah Nadi. 2019. Analyzing comment-induced updates on Stack Overflow. In2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR). IEEE, 220–224. doi:10.1109/MSR.2019.00040

  43. [43]

    Henry Tang and Sarah Nadi. 2021. On using Stack Overflow comment-edit pairs to recommend code maintenance changes.Empirical Software Engineering26, 4 (2021), 68. doi:10.1007/s10664-021-09954-8

  44. [44]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Roziere, N. Goyal, E. Hambro, F. Azhar, et al. 2023. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971(2023)

  45. [45]

    Michele Tufano, Jevgenija Pantiuchina, Cody Watson, Gabriele Bavota, and Denys Poshyvanyk. 2019. On Learning Meaningful Code Changes via Neural Machine Translation. InProceedings of the 41st International Conference on Software Engineering (ICSE). 25–36. doi:10.1109/ICSE.2019.00021

  46. [46]

    Shaowei Wang, Tse-Hsun Chen, and Ahmed E Hassan. 2018. How do users revise answers on technical Q&A websites? A case study on stack overflow.IEEE Transactions on Software Engineering46, 9 (2018), 1024–1038

  47. [47]

    Shaowei Wang, David Lo, and Lingxiao Jiang. 2013. An Empirical Study on Developer Interactions in StackOverflow. InProceedings of the 28th Annual ACM Symposium on Applied Computing (SAC). ACM, New York, NY, USA, 1019–1024. doi:10.1145/2480362.2480557

  48. [48]

    Yue Wang, Weishi Wang, Shafiq Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder–Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 8696–8708. doi:10.18653 /v1/2021.emnlp-main.685

  49. [49]

    Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations. arXiv preprint arXiv:2510.22780. doi:10.48550/arXiv.2510.22780

  50. [50]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. 2022. Chain-of- Thought Prompting Elicits Reasoning in Large Language Models.arXiv preprint arXiv:2201.11903(2022). https: //arxiv.org/abs/2201.11903

  51. [51]

    Yuhao Wu, Shaowei Wang, Cor-Paul Bezemer, and Katsuro Inoue. 2019. How Do Developers Utilize Source Code from Stack Overflow?Empirical Software Engineering24, 2 (2019), 637–673. doi:10.1007/s10664-018-9634-5

  52. [52]

    Xiwei Xu, Dawen Zhang, Qing Liu, Qinghua Lu, and Liming Zhu. 2025. Agentic RAG with Human-in-the-Retrieval. In Proc. of the IEEE Int. Conf. on Software Architecture Companion (ICSA-C). 498–502. doi:10.1109/ICSA-C65153.2025.00074

  53. [53]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 37 (NeurIPS). doi:10.52202/079017-1601

  54. [54]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Nan Du, Izhak Shafran, Karthik Narasimhan, and Denny Zhou. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models.arXiv preprint arXiv:2305.10601(2023). https: //arxiv.org/abs/2305.10601

  55. [55]

    Pengcheng Yin and Graham Neubig. 2018. Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow. InInternational Conference on Mining Software Repositories (MSR). 476–486

  56. [56]

    Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, and Ahmed E. Hassan. 2021. Are Comments on Stack Overflow Well Organized for Easy Retrieval by Developers?ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 22:1–22:31. doi:10.1145/3434279

  57. [57]

    Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, and Ahmed E. Hassan. 2021. Reading Answers on Stack Overflow: Not Enough!IEEE Transactions on Software Engineering47, 11 (2021), 2520–2533. doi:10.1109/TSE.2019.295 4319

  58. [58]

    Haoxiang Zhang, Shaowei Wang, Tse-Hsun (Peter) Chen, Ying Zou, and Ahmed E. Hassan. 2021. An Empirical Study of Obsolete Answers on Stack Overflow.IEEE Transactions on Software Engineering47, 4 (2021), 850–862. doi:10.1109/TSE.2019.2906315

  59. [59]

    Zhaoyi Zhang and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Repair via Code Navigation and LLM Reasoning.arXiv preprint arXiv:2404.02341(2024). https://arxiv.org/abs/2404.02341

  60. [60]

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.18223(2023). ACM Trans. Softw. Eng. Methodol., Vol. 0, No. 0, Article 0. Publication date: October 2025