pith. sign in

arxiv: 2410.22240 · v3 · submitted 2024-10-29 · 💻 cs.SE

Are Decoder-Only Large Language Models the Silver Bullet for Code Search?

Pith reviewed 2026-05-23 18:37 UTC · model grok-4.3

classification 💻 cs.SE
keywords code searchdecoder-only LLMsfine-tuningCodeGemmaUniXcoderCoSQA+model sizetraining data
0
0 comments X

The pith

Fine-tuned decoder-only LLMs like CodeGemma outperform encoder-only models by 40.4% MAP on CoSQA+ for code search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether decoder-only large language models can handle code search retrieval better than the encoder-only models commonly used for this task. It runs a large-scale comparison of eleven decoder-only models under zero-shot and fine-tuned conditions against baselines such as UniXcoder. Fine-tuning produces clear gains, with the strongest decoder-only model exceeding the best encoder-only result by a substantial margin on the main benchmark. Two practical patterns emerge: performance does not rise steadily with model size, and the balance of languages in the training data strongly affects cross-language results. The work supplies concrete selection rules for using these models in code search systems.

Core claim

Fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder, achieving a 40.4% higher Mean Average Precision (MAP) on the CoSQA+ benchmark. The relationship between model size and performance is non-monotonic, with mid-sized models often outperforming larger variants; the composition of the training data is critical, as a multilingual dataset enhances generalization while a small amount of data from a specific language can act as noise and interfere with model effectiveness.

What carries the argument

Systematic zero-shot and fine-tuned evaluation of eleven decoder-only LLMs on code search benchmarks, with explicit controls for model size and training-data language mix.

If this is right

  • Fine-tuned decoder-only models should replace encoder-only models as the default choice for code search.
  • Mid-sized decoder-only models can be selected over larger ones without sacrificing accuracy.
  • Multilingual training data should be used to improve generalization across programming languages.
  • Small quantities of data from one language should be avoided during training to prevent interference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Code search tools could be rebuilt around fine-tuned decoder-only models to reach higher retrieval accuracy in practice.
  • Similar size and data-composition effects may appear when the same models are applied to other retrieval tasks in software engineering.
  • Optimal model choice for a given deployment may require testing a range of sizes rather than defaulting to the largest available variant.

Load-bearing premise

The chosen benchmarks and fine-tuning protocols provide a fair and generalizable measure of real-world code search performance without hidden selection effects or domain mismatch.

What would settle it

Running the same models on an independently collected code search benchmark where the MAP advantage of the fine-tuned decoder-only models disappears or reverses.

Figures

Figures reproduced from arXiv: 2410.22240 by Anji Li, Dekun Dai, Guangsheng Ou, Mingwei Liu, Yanlin Wang, Yuxuan Chen, Zibin Zheng.

Figure 1
Figure 1. Figure 1: Pooling Strategies in Encoder-Only and Decoder-Only Models evaluation metrics, based on the ranking of ground truths for each query. MRR evaluates the effectiveness of retrieving rele￾vant code snippets early in the ranking, while MAP provides a comprehensive measure of search performance by considering the precision of relevant results across the entire ranking. For the CSN dataset, which contains a singl… view at source ↗
Figure 2
Figure 2. Figure 2: Attention Mechanism A. Design Training Approach. We fine-tuned the models using the CSN dataset and supervised contrastive learning (SupCon), follow￾ing common practices [10], [68]. This method minimizes the distance between similar samples and maximizes the distance between different ones, ensuring relevant queries and code snippets are closely represented in the embedding space, while irrelevant ones are… view at source ↗
Figure 3
Figure 3. Figure 3: PCA Visualization of UniXcoder’s Embedding Space Before and After Fine-tuning on CSN. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PCA Visualization of CodeGemma’s Embedding Space Before and After Fine-tuning on CSN. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Similarity Histograms of Different Fine-Tuning Methods [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Histogram of Query Length 0 3 6 9 12 15 18 21 24 27 Rank 0.0 0.1 0.2 0.3 0.4 Frequency [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: Ultra-Short Queries Without Context To test the first reason, we conducted a small experiment by repeating queries to double their token length without changing semantics. We tested ultra-short Go language queries on CSN using Llama3 [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Fine-tuning Performance of CodeGemma and UniXcoder on CSN 0 200 400 600 800 1000 Step 0.00 0.05 0.10 0.15 0.20 0.25 MAP UniXcoder CodeGemma [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
read the original abstract

Code search is essential for code reuse, allowing developers to efficiently locate relevant code snippets. The advent of powerful decoder-only Large Language Models (LLMs) has revolutionized many code intelligence tasks. However, their effectiveness for the retrieval-based task of code search, particularly compared to established encoder-based models, remains underexplored. This paper addresses this gap by presenting a large-scale systematic evaluation of eleven decoder-only LLMs, analyzing their performance across zero-shot and fine-tuned settings. Our results show that fine-tuned decoder-only models, particularly CodeGemma, significantly outperform encoder-only models like UniXcoder, achieving a 40.4% higher Mean Average Precision (MAP) on the CoSQA$^+$ benchmark. Our analysis further reveals two crucial nuances for practitioners: first, the relationship between model size and performance is non-monotonic, with mid-sized models often outperforming larger variants; second, the composition of the training data is critical, as a multilingual dataset enhances generalization while a small amount of data from a specific language can act as noise and interfere with model effectiveness. These findings offer a comprehensive guide to selecting and optimizing modern LLMs for code search.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale empirical evaluation of eleven decoder-only LLMs for code search in both zero-shot and fine-tuned regimes, comparing them to encoder-only baselines such as UniXcoder. It claims that fine-tuned decoder-only models, particularly CodeGemma, deliver a 40.4% higher MAP on the CoSQA+ benchmark, while also reporting a non-monotonic relationship between model size and performance and the critical role of multilingual versus language-specific training data composition.

Significance. If the decoder-only versus encoder-only comparisons are performed under identical fine-tuning conditions, the results would provide actionable guidance on architecture choice and data curation for code retrieval tasks, potentially shifting preference away from encoder-only models in this domain. The non-monotonic size finding and data-composition nuance are practically useful observations that could be tested in follow-up work.

major comments (2)
  1. [Methods / Experimental Setup] Methods / Experimental Setup section: The manuscript must explicitly document whether the UniXcoder (and other encoder-only) baseline results were produced by re-fine-tuning the model on the identical dataset, loss, hyperparameters, and train/validation/test splits used for the decoder-only models. The central 40.4% MAP improvement claim on CoSQA+ is load-bearing on this matched-protocol detail; any reliance on previously published numbers without re-implementation would introduce a protocol confound that undermines the architecture comparison.
  2. [Results] Results section (performance tables): The paper should report the exact fine-tuning data volume and language distribution used for each model family so that readers can verify whether the multilingual advantage is driven by data scale or by the decoder architecture itself.
minor comments (2)
  1. [Abstract / Introduction] The abstract and introduction should define the precise modifications that distinguish CoSQA+ from the original CoSQA benchmark.
  2. [Figures / Tables] Figure captions and tables would benefit from explicit mention of the number of random seeds or runs underlying the reported MAP scores.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on clarifying our experimental protocols and data details. These points strengthen the manuscript's transparency. We address each major comment below.

read point-by-point responses
  1. Referee: [Methods / Experimental Setup] Methods / Experimental Setup section: The manuscript must explicitly document whether the UniXcoder (and other encoder-only) baseline results were produced by re-fine-tuning the model on the identical dataset, loss, hyperparameters, and train/validation/test splits used for the decoder-only models. The central 40.4% MAP improvement claim on CoSQA+ is load-bearing on this matched-protocol detail; any reliance on previously published numbers without re-implementation would introduce a protocol confound that undermines the architecture comparison.

    Authors: All encoder-only baselines including UniXcoder were re-fine-tuned from scratch using the exact same training dataset, loss function, hyperparameters, and train/validation/test splits as the decoder-only models. This matched protocol was followed to ensure a fair architecture comparison, but the manuscript did not explicitly state it. We will revise the Methods section to document this detail clearly, removing any potential ambiguity around the 40.4% MAP claim. revision: yes

  2. Referee: [Results] Results section (performance tables): The paper should report the exact fine-tuning data volume and language distribution used for each model family so that readers can verify whether the multilingual advantage is driven by data scale or by the decoder architecture itself.

    Authors: We agree that explicit reporting of fine-tuning data volume and language distribution per model family is necessary to isolate architecture effects from data factors. The revised manuscript will include this information, either as an expanded table or dedicated subsection in the Experimental Setup or Results, detailing token counts and language breakdowns for each fine-tuning run. revision: yes

Circularity Check

0 steps flagged

Pure empirical benchmarking; no derivations or self-referential reductions

full rationale

The paper reports direct experimental results from fine-tuning decoder-only LLMs and comparing MAP scores against published encoder-only baselines on CoSQA+. No equations, fitted parameters renamed as predictions, uniqueness theorems, or ansatzes appear. The central claim is an observed performance delta under stated protocols; it does not reduce to its own inputs by construction. Self-citations, if present, are not load-bearing for any derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the work is an empirical benchmark study.

pith-pipeline@v0.9.0 · 5752 in / 925 out tokens · 19220 ms · 2026-05-23T18:37:06.194809+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. XSearch: Explainable Code Search via Concept-to-Code Alignment

    cs.SE 2026-05 unverdicted novelty 6.0

    XSearch achieves explainable code search by breaking queries into functional concepts and matching them directly to code statements, delivering large gains on out-of-distribution benchmarks.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 1 Pith paper · 23 internal anchors

  1. [2]

    Accelerating onboarding,

    K. Lee, “Accelerating onboarding,” 2022

  2. [3]

    Chatgpt: Optimizing language models for dialogue,

    OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2023. [Online]. Available: https://openai.com/chatgpt

  3. [4]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y . Wu, Y . Li et al. , “Deepseek-coder: When the large language model meets programming–the rise of code intelligence,” arXiv preprint arXiv:2401.14196, 2024

  4. [5]

    Retrieval-Augmented Generation for Large Language Models: A Survey

    Y . Gao, Y . Xiong, X. Gao, K. Jia, J. Pan, Y . Bi, Y . Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997 , 2023

  5. [6]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le et al. , “Program synthesis with large language models,” arXiv preprint arXiv:2108.07732 , 2021

  6. [7]

    Cocosoda: Effective contrastive learning for code search,

    E. Shi, Y . Wang, W. Gu, L. Du, H. Zhang, S. Han, D. Zhang, and H. Sun, “Cocosoda: Effective contrastive learning for code search,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2198–2210

  7. [8]

    GraphCodeBERT: Pre-training Code Representations with Data Flow

    D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al. , “Graphcodebert: Pre-training code repre- sentations with data flow,” arXiv preprint arXiv:2009.08366 , 2020

  8. [9]

    CodeBERT: A Pre-Trained Model for Programming and Natural Languages

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang et al., “Codebert: A pre-trained model for programming and natural languages,” arXiv preprint arXiv:2002.08155 , 2020

  9. [10]

    UniXcoder: Unified Cross-Modal Pre-training for Code Representation

    D. Guo, S. Lu, N. Duan, Y . Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” arXiv preprint arXiv:2203.03850, 2022

  10. [11]

    arXiv preprint arXiv:2103.06333 (2021)

    W. U. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” arXiv preprint arXiv:2103.06333, 2021

  11. [12]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman et al., “Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374 , 2021

  12. [13]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    J. W. Rae, S. Borgeaud, T. Cai, K. Millican, J. Hoffmann, F. Song, J. Aslanides, S. Henderson, R. Ring, S. Young et al., “Scaling language models: Methods, analysis & insights from training gopher,” arXiv preprint arXiv:2112.11446, 2021

  13. [14]

    Zheng, K

    Z. Zheng, K. Ning, J. Chen, Y . Wang, W. Chen, L. Guo, and W. Wang, “Towards an understanding of large language models in software engi- neering tasks,” arXiv preprint arXiv:2308.11396 , 2023

  14. [15]

    Zheng, K

    Z. Zheng, K. Ning, Y . Wang, J. Zhang, D. Zheng, M. Ye, and J. Chen, “A survey of large language models for code: Evolution, benchmarking, and future trends,” arXiv preprint arXiv:2311.10372 , 2023

  15. [16]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    X. Bi, D. Chen, G. Chen, S. Chen, D. Dai, C. Deng, H. Ding, K. Dong, Q. Du, Z. Fu et al. , “Deepseek llm: Scaling open-source language models with longtermism,” arXiv preprint arXiv:2401.02954 , 2024

  16. [17]

    Code Llama: Open Foundation Models for Code

    B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. P. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D’efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” ArXiv, vol. abs/2...

  17. [18]

    Decoderllms-codesearch,

    Georgepitt, “Decoderllms-codesearch,” 2024, accessed: 2024- 10-17. [Online]. Available: https://github.com/Georgepitt/ DecoderLLMs-CodeSearch

  18. [19]

    Sniff: A search engine for java using free-form queries,

    S. Chatterjee, S. Juvekar, and K. Sen, “Sniff: A search engine for java using free-form queries,” in Fundamental Approaches to Software Engineering: 12th International Conference, F ASE 2009, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2009, York, UK, March 22-29, 2009. Proceedings 12 . Springer, 2009, pp. 385–400

  19. [20]

    Deep code search,

    X. Gu, H. Zhang, and S. Kim, “Deep code search,” in Proceedings of the 40th International Conference on Software Engineering , 2018, pp. 933–944

  20. [21]

    A systematic literature review on the use of deep learning in software engineering research,

    C. Watson, N. Cooper, D. N. Palacio, K. Moran, and D. Poshyvanyk, “A systematic literature review on the use of deep learning in software engineering research,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–58, 2022

  21. [22]

    Leveraging code generation to improve code retrieval and summarization via dual learning,

    W. Ye, R. Xie, J. Zhang, T. Hu, X. Wang, and S. Zhang, “Leveraging code generation to improve code retrieval and summarization via dual learning,” in Proceedings of The Web Conference 2020 , 2020, pp. 2309– 2319

  22. [23]

    Improving code search with co-attentive representation learning,

    J. Shuai, L. Xu, C. Liu, M. Yan, X. Xia, and Y . Lei, “Improving code search with co-attentive representation learning,” in Proceedings of the 28th International Conference on Program Comprehension , 2020, pp. 196–207

  23. [24]

    Learning code-query interaction for enhancing code searches,

    W. Li, H. Qin, S. Yan, B. Shen, and Y . Chen, “Learning code-query interaction for enhancing code searches,” in 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME) . IEEE, 2020, pp. 115–126

  24. [25]

    Adaptive deep code search,

    C. Ling, Z. Lin, Y . Zou, and B. Xie, “Adaptive deep code search,” in Proceedings of the 28th International Conference on Program Compre- hension, 2020, pp. 48–59

  25. [26]

    Multi- modal attention network learning for semantic source code retrieval,

    Y . Wan, J. Shu, Y . Sui, G. Xu, Z. Zhao, J. Wu, and P. Yu, “Multi- modal attention network learning for semantic source code retrieval,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 2019, pp. 13–25

  26. [27]

    Deep graph matching and searching for semantic code retrieval,

    X. Ling, L. Wu, S. Wang, G. Pan, T. Ma, F. Xu, A. X. Liu, C. Wu, and S. Ji, “Deep graph matching and searching for semantic code retrieval,” ACM Transactions on Knowledge Discovery from Data (TKDD) , vol. 15, no. 5, pp. 1–21, 2021

  27. [28]

    Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search,

    L. Du, X. Shi, Y . Wang, E. Shi, S. Han, and D. Zhang, “Is a single model enough? mucos: A multi-model ensemble learning approach for semantic code search,” in Proceedings of the 30th ACM International Conference on Information & Knowledge Management , 2021, pp. 2994– 2998

  28. [29]

    Ocor: An overlapping-aware code retriever,

    Q. Zhu, Z. Sun, X. Liang, Y . Xiong, and L. Zhang, “Ocor: An overlapping-aware code retriever,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering , 2020, pp. 883–894

  29. [30]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018

  30. [31]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    H. Husain, H.-H. Wu, T. Gazit, M. Allamanis, and M. Brockschmidt, “Codesearchnet challenge: Evaluating the state of semantic code search,” arXiv preprint arXiv:1909.09436 , 2019

  31. [32]

    Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transforma- tions,

    N. D. Bui, Y . Yu, and L. Jiang, “Self-supervised contrastive learning for code retrieval and summarization via semantic-preserving transforma- tions,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2021, pp. 511– 521

  32. [33]

    On the importance of building high-quality training datasets for neural code search,

    Z. Sun, L. Li, Y . Liu, X. Du, and L. Li, “On the importance of building high-quality training datasets for neural code search,” in Proceedings of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 the 44th International Conference on Software Engineering , 2022, pp. 1609–1620

  33. [34]

    CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation

    Y . Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” arXiv preprint arXiv:2109.00859 , 2021

  34. [35]

    In 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023

    C. Niu, C. Li, V . Ng, D. Chen, J. Ge, and B. Luo, “An empirical comparison of pre-trained models of source code,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023, Melbourne, Australia, May 14-20, 2023 . IEEE, 2023, pp. 2136–2148. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00180

  35. [36]

    Fine-tuning llama for multi-stage text retrieval,

    X. Ma, L. Wang, N. Yang, F. Wei, and J. Lin, “Fine-tuning llama for multi-stage text retrieval,” in Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024, pp. 2421–2425

  36. [37]

    [Online]

    Llama2, 2023. [Online]. Available: https://ai.meta.com/research/ publications/llama-2-open-foundation-and-fine-tuned-chat-models/

  37. [38]

    Improving text embeddings with large language models,

    L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F. Wei, “Improving text embeddings with large language models,”arXiv preprint arXiv:2401.00368, 2023

  38. [39]

    Repetition improves language model embeddings,

    J. M. Springer, S. Kotha, D. Fried, G. Neubig, and A. Raghunathan, “Repetition improves language model embeddings,” arXiv preprint arXiv:2402.15449, 2024

  39. [40]

    LLM2Vec: Large language models are secretly powerful text encoders

    P. BehnamGhader, V . Adlakha, M. Mosbach, D. Bahdanau, N. Chapados, and S. Reddy, “Llm2vec: Large language models are secretly powerful text encoders,” arXiv preprint arXiv:2404.05961 , 2024

  40. [41]

    Cosqa+: En- hancing code search dataset with matching code,

    J. Gong, Y . Wu, L. Liang, Z. Zheng, and Y . Wang, “Cosqa+: En- hancing code search dataset with matching code,” arXiv preprint arXiv:2406.11589, 2024

  41. [42]

    Cosqa: 20,000+ web queries for code search and question answering,

    J. Huang, D. Tang, L. Shou, M. Gong, K. Xu, D. Jiang, M. Zhou, and N. Duan, “Cosqa: 20,000+ web queries for code search and question answering,” arXiv preprint arXiv:2105.13239 , 2021

  42. [43]

    Staqc: A systematically mined question-code dataset from stack overflow,

    Z. Yao, D. S. Weld, W.-P. Chen, and H. Sun, “Staqc: A systematically mined question-code dataset from stack overflow,” in Proceedings of the 2018 World Wide Web Conference , 2018, pp. 1693–1703

  43. [44]

    The trec-8 question answering track report

    E. M. V oorhees et al. , “The trec-8 question answering track report.” in Trec, vol. 99, 1999, pp. 77–82

  44. [45]

    Information retrieval: recent advances and beyond,

    K. A. Hambarde and H. Proenca, “Information retrieval: recent advances and beyond,” IEEE Access , 2023

  45. [46]

    Meta-llama-3-8b-instruct,

    Meta, “Meta-llama-3-8b-instruct,” https://huggingface.co/meta-llama/ Meta-Llama-3-8B-Instruct, 2024

  46. [47]

    (2024) Mistral-7b-instruct-v0.2

    Mistral. (2024) Mistral-7b-instruct-v0.2. [Online]. Available: https: //huggingface.co/mistralai/Mistral-7B-Instruct-v0.2

  47. [48]

    Deepseek-llm-7b-chat,

    DeepSeek-AI, “Deepseek-llm-7b-chat,” https://huggingface.co/ deepseek-ai/deepseek-llm-7b-chat, 2024

  48. [49]

    Gemma-7b-it,

    Google, “Gemma-7b-it,” https://huggingface.co/google/gemma-7b-it, 2024

  49. [50]

    Llama-2-7b-hf,

    Meta, “Llama-2-7b-hf,” https://huggingface.co/meta-llama/ Llama-2-7b-hf, 2024

  50. [51]

    Qwen2.5-coder-7b-instruct,

    Qwen Team, “Qwen2.5-coder-7b-instruct,” https://huggingface.co/ Qwen/Qwen2.5-Coder-7B-Instruct, 2025, accessed: 2025-07-04

  51. [52]

    Starcoder2-7b,

    BigCode Team, “Starcoder2-7b,” https://huggingface.co/bigcode/ starcoder2-7b, 2024, accessed: 2025-07-04

  52. [53]

    Mistral-7b-instruct-v0.2-code-ft-gguf,

    TheBloke, “Mistral-7b-instruct-v0.2-code-ft-gguf,” https://huggingface. co/TheBloke/Mistral-7B-Instruct-v0.2-code-ft-GGUF, 2024

  53. [54]

    Deepseek-coder-6.7b-instruct,

    DeepSeek-AI, “Deepseek-coder-6.7b-instruct,” https://huggingface.co/ deepseek-ai/deepseek-coder-6.7b-instruct, 2024

  54. [55]

    Codegemma-7b-it,

    Google, “Codegemma-7b-it,” https://huggingface.co/google/ codegemma-7b-it, 2024

  55. [56]

    Code Llama: Open Foundation Models for Code

    B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, T. Remez, J. Rapin et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950 , 2023

  56. [57]

    Codebert-base,

    Microsoft, “Codebert-base,” https://huggingface.co/microsoft/ codebert-base, 2024

  57. [58]

    Unixcoder-base,

    ——, “Unixcoder-base,” https://huggingface.co/microsoft/ unixcoder-base, 2024

  58. [59]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al. , “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

  59. [60]

    Mistral 7B

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier et al. , “Mistral 7b,” arXiv preprint arXiv:2310.06825 , 2023

  60. [61]

    [Online]

    Code-Mistral, 2024. [Online]. Available: https://huggingface.co/ ajibawa-2023/Code-Mistral-7B

  61. [62]

    Code-290k-sharegpt,

    Ajibawa, “Code-290k-sharegpt,” https://huggingface.co/datasets/ ajibawa-2023/Code-290k-ShareGPT, 2024

  62. [63]

    Gemma: Open Models Based on Gemini Research and Technology

    G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love et al. , “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024

  63. [64]

    Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406.11409, 2024

    C. T. H. Zhao, J. Hui, J. Howland, N. Nguyen, S. Zuo, A. Hu, C. A. Choquette-Choo, J. Shen, J. Kelley, K. Bansal, L. Vilnis, M. Wirth, P. Michel, P. Choy, P. Joshi, R. Kumar, S. Hashmi, S. Agrawal, Z. Gong, J. Fine, T. B. Warkentin, A. J. Hartman, B. Ni, K. Korevec, K. Schaefer, and S. Huffman, “Codegemma: Open code models based on gemma,” ArXiv, vol. abs...

  64. [65]

    The Llama 3 Herd of Models

    A. Dubey, A. Jauhri, A. Pandey, and et al., “The llama 3 herd of models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.21783

  65. [66]

    StarCoder 2 and The Stack v2: The Next Generation

    A. Lozhkov, R. Li, L. B. Allal, F. Cassano, J. Lamy-Poirier, N. Tazi, A. Tang, D. Pykhtar, J. Liu, Y . Wei et al. , “Starcoder 2 and the stack v2: The next generation,” arXiv preprint arXiv:2402.19173 , 2024

  66. [67]

    Qwen2.5-Coder Technical Report

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Lu et al. , “Qwen2. 5-coder technical report,” arXiv preprint arXiv:2409.12186, 2024

  67. [68]

    Survey of code search based on deep learning,

    Y . Xie, J. Lin, H. Dong, L. Zhang, and Z. Wu, “Survey of code search based on deep learning,” ACM Transactions on Software Engineering and Methodology , vol. 33, no. 2, pp. 1–42, 2023

  68. [69]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692 , 2019

  69. [70]

    Codesc: A large code-description parallel dataset,

    M. Hasan, T. Muttaqueen, A. A. Ishtiaq, K. S. Mehrab, M. M. A. Haque, T. Hasan, W. U. Ahmad, A. Iqbal, and R. Shahriyar, “Codesc: A large code-description parallel dataset,” arXiv preprint arXiv:2105.14220 , 2021

  70. [71]

    MTEB Leaderboard,

    “MTEB Leaderboard,” https://huggingface.co/spaces/mteb/leaderboard

  71. [72]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821 , 2021

  72. [73]

    Sheared-llama-1.3b model card,

    P. NLP, “Sheared-llama-1.3b model card,” https://huggingface.co/ princeton-nlp/Sheared-LLaMA-1.3B, 2023, accessed: 2025-08-20

  73. [74]

    Llama-2-13b-hf model card,

    Meta, “Llama-2-13b-hf model card,” https://huggingface.co/meta-llama/ Llama-2-13b-hf, 2023, accessed: 2025-08-20

  74. [75]

    Qwen2.5-coder-0.5b-instruct model card,

    Q. Team, “Qwen2.5-coder-0.5b-instruct model card,” https: //huggingface.co/Qwen/Qwen2.5-Coder-0.5B-Instruct, 2024, accessed: 2025-08-20

  75. [76]

    Qwen2.5-coder-1.5b-instruct model card,

    ——, “Qwen2.5-coder-1.5b-instruct model card,” https://huggingface. co/Qwen/Qwen2.5-Coder-1.5B-Instruct, 2024, accessed: 2025-08-20

  76. [77]

    Qwen2.5-coder-3b-instruct model card,

    ——, “Qwen2.5-coder-3b-instruct model card,” https://huggingface.co/ Qwen/Qwen2.5-Coder-3B-Instruct, 2024, accessed: 2025-08-20

  77. [78]

    Qwen2.5-coder-14b-instruct model card,

    ——, “Qwen2.5-coder-14b-instruct model card,” https://huggingface.co/ Qwen/Qwen2.5-Coder-14B-Instruct, 2024, accessed: 2025-08-20

  78. [79]

    Language Models are Few-Shot Learners

    B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal et al. , “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165 , vol. 1, 2020

  79. [80]

    with Code

    P. with Code. (2024) Code generation on mbpp. Accessed: August 1, 2024. [Online]. Available: https://paperswithcode.com/ sota/code-generation-on-mbpp