pith. machine review for the scientific record. sign in

arxiv: 2605.14503 · v1 · submitted 2026-05-14 · 💻 cs.SE

Recognition: no theorem link

Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:48 UTC · model grok-4.3

classification 💻 cs.SE
keywords RAGRetrieval-Augmented GenerationSoftware EngineeringEmpirical EvaluationBM25Code GenerationLLMRetrieval Models
0
0 comments X

The pith

Retriever components, especially the algorithm, often influence RAG performance for software engineering tasks more than the generator model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a detailed breakdown of Retrieval-Augmented Generation systems for software engineering tasks by testing many different combinations of components. It shows that how information is retrieved has a bigger effect on results than which large language model generates the final output. Practitioners can use this to focus their efforts on improving retrieval rather than endlessly testing new generators. The study covers code generation, summarization, and repair using over 21 models and methods.

Core claim

The empirical study isolates the effects of query processing, retrieval models including BM25, context refinement, and generators across three SE tasks, revealing that the retrieval algorithm choice frequently has a larger impact on system performance than the generator model selection, with the lexical retriever BM25 performing robustly across tasks.

What carries the argument

The component-wise isolation and evaluation of RAG pipeline elements, with special focus on the retrieval algorithm's role in determining overall performance.

If this is right

  • Optimizing retrieval algorithms can provide greater performance improvements than changing the generator model.
  • BM25 serves as a reliable and effective retrieval method for various software engineering RAG applications.
  • System builders should prioritize retrieval-side enhancements when developing RAG for code-related tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Lexical retrieval like BM25 may excel in code tasks because exact matches to identifiers and syntax are critical.
  • These findings could extend to other retrieval-heavy domains beyond software engineering.
  • Developers might achieve better results by combining strong retrievers with simpler generators to reduce costs.

Load-bearing premise

That the results from the three specific SE tasks and chosen models and datasets will hold for other software engineering problems and real-world codebases.

What would settle it

Running the same component comparisons on additional SE tasks such as code summarization for larger projects or different programming languages and checking if the retriever still dominates performance.

Figures

Figures reproduced from arXiv: 2605.14503 by Haoyu Wang, Hongjin Leng, Qiang Ke, Shengming Zhao, Yanjie Zhao.

Figure 1
Figure 1. Figure 1: The overall architecture of our RAG framework. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Extractive-Abstractive Pipeline of the Zero-shot Recomp Adaptation. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Decision Logic of the LLM-Driven Adaptive RAG Framework. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of context compression methods across five evaluation scenarios. The charts illustrate [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance comparison of different generators across all tasks. RAG performance is shown for [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Impact of the number of retrieved documents ( [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-temporal validation of core RAG findings. The relative performance trends across (a) Query [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
read the original abstract

While Retrieval-Augmented Generation (RAG) is increasingly adopted to ground Large Language Models (LLMs) in software artifacts, the optimal configuration of its components remains an open question for software engineering (SE) tasks. The lack of systematic guidance forces practitioners into costly, ad-hoc experimentation. This paper presents a comprehensive, component-wise empirical study that dissects the RAG pipeline, evaluating over 21 distinct models and methods. Our study systematically isolates and evaluates 4 query processing techniques, 7 retrieval models spanning sparse, dense, and hybrid paradigms, 4 context refinement methods, and 6 distinct generators. We test these components on a suite of 3 core SE tasks: code generation, summarization, and repair. Our empirical findings reveal a crucial insight: the retriever-side components, particularly the choice of the retrieval algorithm, often exert a more significant influence on final system performance than the selection of the generator model. Strikingly, the classic lexical retriever BM25 demonstrates exceptionally robust performance across diverse tasks. Our analysis provides a practical, data-driven roadmap for researchers and practitioners, offering clear guidance on prioritizing optimization efforts when constructing effective RAG systems for software engineering contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a component-wise empirical study of Retrieval-Augmented Generation (RAG) pipelines for three software engineering tasks (code generation, summarization, and repair). It isolates and evaluates 4 query processing techniques, 7 retrieval models (sparse, dense, and hybrid), 4 context refinement methods, and 6 generators, reporting that retriever-side components—particularly the choice of retrieval algorithm—exert greater influence on final performance than generator selection, with the classic BM25 retriever showing robust results across tasks.

Significance. If the comparative influence findings hold under controlled analysis, the work supplies actionable, data-driven guidance for SE practitioners constructing RAG systems and highlights that retrieval choices may warrant higher priority than generator upgrades. The emphasis on BM25's consistent performance offers a concrete, low-cost baseline that could reduce reliance on expensive neural retrievers in code-related applications.

major comments (2)
  1. [Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.
  2. [Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.
minor comments (1)
  1. [Abstract] The abstract states 'over 21 distinct models and methods' while the component counts sum exactly to 21; confirm that the full text consistently reports the total number of unique RAG configurations actually evaluated rather than the sum of component options.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Results and Analysis] The central claim that retriever-side components exert more influence than generator selection requires matched ablations: performance deltas (or ranges) across the 7 retrieval models for each fixed generator must be directly compared against deltas across the 6 generators for each fixed retriever (e.g., via max or average spread, or factorial ANOVA). The abstract and results presentation do not report such effect-size comparisons, so the 'more significant' assertion rests on an unverified assumption about comparable magnitudes.

    Authors: We agree that quantifying the relative influence through matched effect-size comparisons would strengthen the central claim. In the revised manuscript, we will add a dedicated analysis computing performance deltas (max-min ranges and average spreads) across the 7 retrieval models for each fixed generator, and directly compare these to the deltas across the 6 generators for each fixed retriever. These results will be presented in an additional table or figure in the results section, with discussion of the magnitudes. We will also update the abstract to reference the quantified comparison. This revision directly addresses the concern. revision: yes

  2. Referee: [Experimental Setup] No details are supplied on statistical testing, variance or standard deviation across runs, confidence intervals, or exact dataset sizes and splits. This absence makes it impossible to determine whether observed component rankings and task differences are reliable or could be artifacts of single-run noise or metric scaling.

    Authors: We acknowledge that these experimental details were omitted. In the revision, we will explicitly report the exact dataset sizes and splits used for each of the three tasks (code generation, summarization, and repair). However, all experiments were conducted as single runs per configuration to manage the substantial computational cost of the full combinatorial evaluation. As a result, we do not have variance, standard deviations, or confidence intervals from multiple runs, and cannot add statistical testing without new experiments. We will state this limitation clearly in the experimental setup section and discuss its implications for interpreting the rankings. revision: partial

standing simulated objections not resolved
  • The absence of variance, standard deviations, confidence intervals, and formal statistical testing, which cannot be added without re-running the full set of experiments with multiple random seeds.

Circularity Check

0 steps flagged

No circularity: purely empirical component comparison

full rationale

The paper performs a direct empirical ablation across 4 query processors, 7 retrievers, 4 refiners, and 6 generators on three fixed SE tasks using standard metrics. No equations, fitted parameters, or predictions appear; all reported influences are measured performance deltas on external datasets. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling exist. The retriever-vs-generator claim rests on observed spreads rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard benchmarking assumptions about task representativeness and metric validity; no free parameters are fitted to produce the central claim and no new entities are postulated.

axioms (1)
  • domain assumption The three selected SE tasks and the chosen models/datasets are sufficiently representative to support general recommendations about component importance.
    Invoked when generalizing the observed retriever dominance to broader RAG practice in software engineering.

pith-pipeline@v0.9.0 · 5521 in / 1152 out tokens · 59265 ms · 2026-05-15T01:48:57.380310+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 16 internal anchors

  1. [1]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, et al. 2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219 [cs.CL]

  2. [2]

    Uri Alon, Meital Zilberstein, Omer Levy, and Eran Yahav. 2019. code2vec: Learning Distributed Representations of Code.Proc. ACM Program. Lang.3, POPL (2019), 40:1–40:29. doi:10.1145/3290353

  3. [3]

    Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs. Procedia Computer Science246 (2024), 3781–3790. doi:10.1016/j.procs.2024.09.178

  4. [4]

    Mihir Athale and Vishal Vaddina. 2025. Knowledge Graph Based Repository-Level Code Generation. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code). IEEE, Piscataway, NJ, USA, 169–176. doi:10. 1109/llm4code66737.2025.00026

  5. [5]

    Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig

    Abhiram Bellur, Fraol Batole, Mohammed Raihan Ullah, Malinda Dilhara, Yaroslav Zharov, Timofey Bryksin, Kai Ishikawa, Haifeng Chen, Masaharu Morimoto, Takeo Hosomi, Tien N. Nguyen, Hridesh Rajan, Nikolaos Tsantalis, and Danny Dig. 2025. Together We are Better: LLM, IDE and Semantic Embedding to Assist Move Method Refactoring. In Proceedings of the 41st IE...

  6. [6]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2402.03216 [cs.CL]

  7. [7]

    Mark Chen, Jerry Tworek, Heewoo Jun, et al . 2021. Evaluating Large Language Models Trained on Code. arXiv:2107.03374 [cs.LG]

  8. [8]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, et al. 2020. Unsupervised Cross-lingual Representation Learning at Scale. arXiv:1911.02116 [cs.CL]

  9. [9]

    Cormack, Charles L

    Gordon V. Cormack, Charles L. A. Clarke, and Stefan Bottcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, James Allan, Javed A. Aslam, Mark Sanderson, ChengXiang Zhai, and Justin Zobel (Eds.). ACM,...

  10. [10]

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, et al. 2025. DeepSeek-V3 Technical Report. arXiv:2412.19437 [cs.CL]

  11. [11]

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. 2024. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, Barcelona Spain, 6491–6501. doi:10.1145/3637528. 3671470

  12. [12]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, et al . 2020. Codebert: A pre-trained model for programming and natural languages. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Stroudsburg, PA, USA, 1536–1547

  13. [13]

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. 2023. Precise Zero-Shot Dense Retrieval without Relevance Labels. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Anna Rogers, Jordan L. Boyd-Graber, and Naoaki Okazaki (Eds.). Association for C...

  14. [14]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997 [cs.CL]

  15. [15]

    Luca Gazzola, Daniela Micucci, and Leonardo Mariani. 2019. Automatic Software Repair: A Survey.IEEE Transactions on Software Engineering45, 1 (2019), 34–67. doi:10.1109/TSE.2017.2755013

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. 2024. The Llama 3 Herd of Models. arXiv:2407.21783 [cs.AI]

  17. [17]

    Wenchao Gu, Juntao Chen, Yanlin Wang, Tianyue Jiang, Xingzhe Li, Mingwei Liu, Xilin Liu, Yuchi Ma, and Zibin Zheng. 2025. What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond. arXiv:2503.20589 [cs.SE]

  18. [18]

    Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. 2024. Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents. arXiv:2310.19923 [cs.CL]

  19. [19]

    Dan Hendrycks, Steven Basart, Saurav Kadavath, et al. 2021. Measuring Coding Challenge Competence With APPS. arXiv:2105.09938 [cs.SE] Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE183. Publication date: July 2026. FSE183:22 Q. Ke, Y. Zhao, H. Leng, S. Zhao, and H. Wang

  20. [20]

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt

  21. [21]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring Mathematical Problem Solving With the MATH Dataset. arXiv:2103.03874 [cs.LG]

  22. [22]

    Gautier Izacard and Edouard Grave. 2021. Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering. arXiv:2007.01282 [cs.CL]

  23. [23]

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. 2025. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025. OpenReview.net, Amherst, MA, USA, 1–15

  24. [24]

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. 2024. LongLLM- Lingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand August 11-16, ...

  25. [25]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2021. Billion-Scale Similarity Search with GPUs.IEEE Transactions on Big Data7, 3 (2021), 535–547. doi:10.1109/TBDATA.2019.2921572

  26. [26]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, et al . 2020. Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 [cs.CL]

  27. [27]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Kuttler, Mike Lewis, Wen tau Yih, Tim Rocktaschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]

  28. [28]

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards General Text Embeddings with Multi-stage Contrastive Learning. arXiv:2308.03281 [cs.CL]

  29. [29]

    Ye Liu, Rui Meng, Shafiq Jot, Silvio Savarese, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. 2025. CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval. arXiv:2411.12644 [cs.CL]

  30. [30]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, et al. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. arXiv:2102.04664 [cs.SE]

  31. [31]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query Rewriting for Retrieval-Augmented Large Language Models. arXiv:2305.14283 [cs.CL]

  32. [32]

    Mayank Mishra, Matt Stallone, Gaoyuan Zhang, et al . 2024. Granite Code Models: A Family of Open Foundation Models for Code Intelligence. arXiv:2405.04324 [cs.AI]

  33. [33]

    Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv:1901.04085 [cs.IR]

  34. [34]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL]

  35. [35]

    Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval Augmented Code Generation and Summarization. arXiv:2108.11601 [cs.SE]

  36. [36]

    Shuo Ren, Daya Guo, Shuai Lu, et al . 2020. CodeBLEU: a Method for Automatic Evaluation of Code Synthesis. arXiv:2009.10297 [cs.SE]

  37. [37]

    S. E. Robertson and S. Walker. 1994. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. InSIGIR ’94: Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. Springer-Verlag, London, UK, 232–241

  38. [38]

    Jiho Shin, Reem Aleithan, Hadi Hemmati, and Song Wang. 2024. Retrieval-Augmented Test Generation: How Far Are We? arXiv:2409.12682 [cs.SE]

  39. [39]

    Lu Shuai, Duan Nan, Han Hojae, Guo Daya, Hwang Seung-won, and Svyatkovskiy Alexey. 2022. ReACC: A Retrieval- Augmented Code Completion Framework. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6227–6240. doi:10. 18653/v1/2022.acl...

  40. [40]

    Jonathan Sillito, Frank Maurer, Seyed Mehdi Nasehi, and Chris Burns. 2012. What makes a good code example? A study of programming Q and A in StackOverflow. InProceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM). IEEE, Piscataway, NJ, USA, 25–34. doi:10.1109/ICSM.2012.6405249

  41. [41]

    Karen Spärck Jones. 1972. A statistical interpretation of term specificity and its application in retrieval.Journal of Documentation28, 1 (1972), 11–21

  42. [42]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, et al . 2024. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agents. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Associat...

  43. [43]

    Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun. 2024. DebugBench: Evaluating Debugging Capability of Large Language Mod- els. arXiv:2401.04621 [cs.SE]

  44. [44]

    Christoph Treude, Ohad Barzilay, and Margaret-Anne Storey. 2011. How do programmers ask and answer questions on the web? (NIER track). InProceedings of the 33rd International Conference on Software Engineering. ACM, New York, NY, USA, 804–807. doi:10.1145/1985793.1985907

  45. [45]

    Chaozheng Wang, Yuanhang Yang, Cuiyun Gao, Yun Peng, Hongyu Zhang, and Michael R. Lyu. 2022. No more fine-tuning? an experimental evaluation of prompt tuning in code intelligence. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 22). ACM, New York, NY, USA, 382–...

  46. [46]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, et al. 2024. Text Embeddings by Weakly-Supervised Contrastive Pre-training. arXiv:2212.03533 [cs.CL]

  47. [47]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. Improving Text Embeddings with Large Language Models. arXiv:2401.00368 [cs.CL]

  48. [48]

    Xiaohua Wang, Zhenghua Wang, Xuan Gao, Feiran Zhang, Yixin Wu, Zhibo Xu, Tianyuan Shi, Zhengyuan Wang, Shizheng Li, Qi Qian, Ruicheng Yin, Changze Lv, Xiaoqing Zheng, and Xuanjing Huang. 2024. Searching for Best Practices in Retrieval-Augmented Generation. arXiv:2407.01219 [cs.CL]

  49. [49]

    Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, and Yi Fang. 2024. Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers. arXiv:2404.03192 [cs.IR]

  50. [50]

    Le, and Denny Zhou

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., Red Hook, NY, USA,...

  51. [51]

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. 2023. RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation. arXiv:2310.04408 [cs.CL]

  52. [52]

    An Yang, Baosong Yang, Beichen Zhang, et al. 2025. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL]

  53. [53]

    Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities.ACM Trans. Softw. Eng. Methodol.34, 7 (2025), 188:1–188:28. doi:10.1145/3717061

  54. [54]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL]

  55. [55]

    Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig

    Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. 2023. DocPrompting: Generating Code by Retrieving the Docs. arXiv:2207.05987 [cs.SE]

  56. [56]

    Xiaoling Zhou, Ou Wu, Weiyao Zhu, and Ziyang Liang. 2022. Understanding Difficulty-Based Sample Weighting with a Universal Difficulty Measure. InMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2022, Grenoble, France, September 19–23, 2022, Proceedings, Part III. Springer-Verlag, Cham, Switzerland, 68–84. doi:10.1007/9...