UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning

Bin Luo; Chuanyi Li; Jidong Ge; Liguo Huang; Ye Fan

arxiv: 2606.27747 · v1 · pith:SV6EKE7Rnew · submitted 2026-06-26 · 💻 cs.SE

UNICS: Multilingual Code Search via Unified Pseudocode and Contrastive Transfer Learning

Ye Fan , Jidong Ge , Chuanyi Li , Liguo Huang , Bin Luo This is my paper

Pith reviewed 2026-06-29 04:06 UTC · model grok-4.3

classification 💻 cs.SE

keywords multilingual code searchpseudocode representationcontrastive learningtransfer learningzero-shot transfercode semantic sliceshard negative sampling

0 comments

The pith

UNICS pre-trains on pseudocode to learn cross-lingual code logic then transfers via multi-task contrastive learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces UNICS to tackle data imbalance, semantic interference, and information loss in multilingual code search. It first pre-trains a model on a new pseudocode dataset to capture algorithm-level logic that stays consistent across languages while keeping full semantics intact. A second stage then adapts this knowledge to individual languages through multi-task learning that breaks code into semantic slices and uses hard positive mining plus dynamic hard negative sampling. If the approach works, code search tools could perform well even for low-resource languages without needing massive language-specific datasets. Sympathetic readers would care because better cross-language code search directly helps developers navigate mixed-language codebases more effectively.

Core claim

UNICS pre-trains on a constructed pseudocode dataset to learn a cross-lingual, algorithm-level logic that preserves full semantic fidelity, then applies a multi-task transfer learning strategy that decomposes code into semantic slices and adds tasks for hard positive mining and cross-lingual dynamic hard negative sampling to adapt the general knowledge to specific languages, producing state-of-the-art results on multilingual and cross-lingual benchmarks with particular strength in zero-shot transfer to low-resource languages.

What carries the argument

Two-stage training that first uses pseudocode as a unified representation for cross-lingual pre-training and then applies multi-task contrastive transfer with semantic slices and hard mining.

Load-bearing premise

The pseudocode dataset keeps every important semantic detail from the original code and the multi-task transfer step moves knowledge to new languages without creating interference between them.

What would settle it

A new low-resource language benchmark where UNICS scores below current baselines, or a controlled test showing that the pseudocode translation step drops critical semantic information.

Figures

Figures reproduced from arXiv: 2606.27747 by Bin Luo, Chuanyi Li, Jidong Ge, Liguo Huang, Ye Fan.

**Figure 2.** Figure 2: The prompt used to guide the LLM in generating our Universal Code, containing detailed specifications [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of four semantic code slices. 3.2.1 Semantic Code Slicing. Traditional methods encode the entire function body as a single unit, which can lead the model to over-rely on surface-level features like function names, while neglecting the internal implementation logic, thereby harming generalization. To address this issue, we decompose source code functions into four types of slices, each carrying dif… view at source ↗

**Figure 4.** Figure 4: Visualization of Performance and Language Discrimination in UNICS. [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

While pre-trained models have achieved remarkable success in code search, their multilingual capabilities remain a major hurdle, plagued by data imbalance, cross-lingual semantic interference, and the loss of critical information from existing unified representations like Abstract Syntax Trees (ASTs) or Intermediate Representations (IRs). Furthermore, conventional contrastive learning strategies often rely on simplistic hard negative sampling while overlooking the potential of mining hard positives to learn code's intrinsic semantic invariance. To address these challenges, we introduce UNICS, a framework for multilingual code search built on a two-stage training strategy. In the first stage, UNICS is pre-trained on a novel dataset we constructed, which uses pseudo-code as a unified representation to learn a cross-lingual, algorithm-level logic that preserves full semantic fidelity. The second stage employs a multi-task transfer learning strategy that adapts this general knowledge to specific languages by decomposing code into semantic slices (e.g., API calls, function bodies) and incorporating tasks for hard positive mining and cross-lingual dynamic hard negative sampling. Experimental results demonstrate that UNICS achieves state-of-the-art performance across multiple multilingual and cross-lingual benchmarks, showcasing superior generalization and performance balance, especially in zero-shot transfer tasks to low-resource languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UNICS's pseudocode pre-training lacks any described validation or construction details, which leaves the SOTA and zero-shot claims unevaluable.

read the letter

The paper presents UNICS as a two-stage system for multilingual code search. The first stage pre-trains on a new pseudocode dataset meant to capture algorithm-level logic that transfers across languages. The second stage adapts that knowledge through multi-task learning on semantic slices, plus hard-positive mining and cross-lingual dynamic negative sampling.

The combination of pseudocode as the bridge representation with those specific contrastive choices is the main thing offered as new. The write-up does a reasonable job naming the practical problems—data imbalance, interference from unified AST or IR representations, and weak negative sampling—that current models face.

The soft spot is the one the stress-test note flags. The abstract states that the pseudocode dataset preserves full semantic fidelity, yet supplies no construction method, no equivalence metric, and no human or automated check. Without that, it is impossible to know whether control-flow or API details survive the translation step. If they do not, the later adaptation stage cannot recover them, and both the low-resource zero-shot results and the overall SOTA claim rest on an untested assumption. The abstract also asserts superior performance but contains no numbers, baselines, dataset sizes, or statistical tests, so the empirical side cannot be assessed from the given text.

This is aimed at researchers working on code search and multilingual representations inside software engineering. A reader already active in that area might pick up the training recipe to try, but only after the full paper supplies the missing dataset details and the actual experimental numbers.

The work shows clear engagement with the stated problems in the literature, so it deserves a serious referee to examine the experiments and the pseudocode construction once they are provided.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes UNICS, a two-stage framework for multilingual code search. The first stage pre-trains a model on a novel pseudocode dataset constructed to serve as a unified cross-lingual representation that preserves full semantic fidelity and captures algorithm-level logic. The second stage applies multi-task transfer learning that decomposes code into semantic slices, incorporates hard-positive mining, and uses cross-lingual dynamic hard-negative sampling. The central claim is that this yields state-of-the-art results on multiple multilingual and cross-lingual code-search benchmarks, with particular gains in zero-shot transfer to low-resource languages.

Significance. If the empirical results are reproducible and the pseudocode representation is shown to preserve semantics, the work would offer a concrete route to mitigating data imbalance and cross-lingual interference in code search, with direct relevance to improving retrieval for low-resource languages.

major comments (3)

[Section 3.1] Section 3.1 (pseudocode dataset construction): no equivalence metric, human validation protocol, or automated semantic-fidelity check is described; the claim that the dataset 'preserves full semantic fidelity' is therefore unsupported, which is load-bearing for both the cross-lingual transfer argument and the zero-shot low-resource results.
[Section 5] Section 5 (experimental results): the abstract and main text assert SOTA performance yet supply no concrete metrics, baseline names, dataset cardinalities, or statistical significance tests; the central empirical claim cannot be evaluated from the provided evidence.
[Section 4.2] Section 4.2 (multi-task transfer): the decomposition into semantic slices and the contrastive objectives are introduced without ablation studies that isolate the contribution of hard-positive mining versus dynamic negative sampling, preventing attribution of reported gains to the proposed components.

minor comments (2)

[Abstract] Abstract: the specific benchmark names and language pairs used for the multilingual and cross-lingual evaluations are not listed.
Notation: the definition of 'semantic slices' (API calls, function bodies, etc.) would benefit from a short illustrative example in the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses

Referee: [Section 3.1] Section 3.1 (pseudocode dataset construction): no equivalence metric, human validation protocol, or automated semantic-fidelity check is described; the claim that the dataset 'preserves full semantic fidelity' is therefore unsupported, which is load-bearing for both the cross-lingual transfer argument and the zero-shot low-resource results.

Authors: We agree that Section 3.1 does not currently describe an equivalence metric, human validation protocol, or automated semantic-fidelity check, leaving the 'full semantic fidelity' claim unsupported. In the revised manuscript we will expand Section 3.1 to include: (i) an automated equivalence metric based on embedding cosine similarity between original code and pseudocode, (ii) a human validation protocol on a random sample of 500 pairs with reported inter-annotator agreement, and (iii) the results of that validation. These additions will directly support the cross-lingual and zero-shot arguments. revision: yes
Referee: [Section 5] Section 5 (experimental results): the abstract and main text assert SOTA performance yet supply no concrete metrics, baseline names, dataset cardinalities, or statistical significance tests; the central empirical claim cannot be evaluated from the provided evidence.

Authors: We acknowledge that the abstract and introductory paragraphs of Section 5 do not list concrete metrics, baseline names, dataset cardinalities, or significance tests, making the SOTA claim difficult to evaluate from those passages alone. We will revise the abstract and the opening of Section 5 to explicitly report key metrics (e.g., MRR on each benchmark), name all baselines, state dataset sizes, and include statistical significance results. The detailed tables already present in Section 5 will remain and be cross-referenced. revision: yes
Referee: [Section 4.2] Section 4.2 (multi-task transfer): the decomposition into semantic slices and the contrastive objectives are introduced without ablation studies that isolate the contribution of hard-positive mining versus dynamic negative sampling, preventing attribution of reported gains to the proposed components.

Authors: We agree that the current version of Section 4.2 introduces the multi-task components without ablation studies isolating hard-positive mining from dynamic hard-negative sampling. In the revision we will add a dedicated ablation subsection (or appendix) that reports performance when each component is removed individually and when both are used together, thereby allowing attribution of gains to the proposed techniques. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmarks and experimental outcomes

full rationale

The paper describes a two-stage training framework evaluated on multilingual code search benchmarks. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. The pseudocode dataset and transfer strategy are presented as inputs to the method, with performance claims tied to external test results rather than self-referential definitions or self-citation chains. This is a standard experimental ML paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unverified assertion that the constructed pseudocode dataset and the multi-task strategy preserve semantics.

pith-pipeline@v0.9.1-grok · 5752 in / 1103 out tokens · 59122 ms · 2026-06-29T04:06:32.896619+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

64 extracted references · 54 canonical work pages · 7 internal anchors

[1]

Github Website

2023. Github Website. https://www.github.com

2023
[2]

Code Embedding

2026. Code Embedding. https://bitbucket.org/anonymous_code/code_embedding

2026
[3]

Shushan Arakelyan, Anna Hakhverdyan, Miltiadis Allamanis, Luis Garcia, Christophe Hauser, and Xiang Ren. 2022. NS3: Neuro-symbolic Semantic Code Search. (2022). doi:10.52202/068431-0761

work page doi:10.52202/068431-0761 2022
[4]

Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. 2010. Example-centric programming: integrating web search into the development environment. InProceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, Georgia, USA, April 10-15, 2010, Elizabeth D. Mynatt, Don Schoner, Geraldine Fitzpatrick,...

work page doi:10.1145/1753326.1753402 2010
[5]

Guo, Joel Lewenstein, Mira Dontcheva, and Scott R

Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. 2009. Two studies of opportunistic programming: interleaving web foraging, learning, and writing code. InProceedings of the 27th International Conference on Human Factors in Computing Systems, CHI 2009, Boston, MA, USA, April 4-9, 2009, Dan R. Olsen Jr., Richard B. Arthur, ...

arXiv 2009
[6]

Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations. InSIGIR ’21: The 44th International ACM SIGIR Conference Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. UNICS: Multilingual Code Search via Unified Pseu...

work page doi:10.1145/3404835.3462840 2021
[7]

Róbert Busa-Fekete, György Szarvas, Tamás Elteto, and Balázs Kégl. 2012. An apple-to-apple comparison of learning- to-rank algorithms in terms of normalized discounted cumulative gain. InECAI 2012-20th European Conference on Artificial Intelligence: Preference Learning: Problems and Applications in AI Workshop, Vol. 242. Ios Press

2012
[8]

José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel,...

work page doi:10.1145/3338906.3340458 2019
[9]

Brock Angus Campbell and Christoph Treude. 2017. NLP2Code: Code Snippet Content Assist via Natural Language Tasks. In2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. IEEE Computer Society, 628–632. doi:10.1109/ICSME.2017.56

work page doi:10.1109/icsme.2017.56 2017
[10]

Yitian Chai, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Cross-Domain Deep Code Search with Meta Learning. In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 487–498. doi:10.1145/3510003.3510125

work page doi:10.1145/3510003.3510125 2022
[11]

Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching connected API subgraph via text phrases. In20th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-20), SIGSOFT/FSE’12, Cary, NC, USA - November 11 - 16, 2012, Will Tracz, Martin P. Robillard, and Tevfik Bultan (Eds.). ACM, 10. doi:10.1145/2393596.2393606

work page doi:10.1145/2393596.2393606 2012
[12]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216(2024). doi:10.48550/arXiv.2402.03216

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03216 2024
[13]

Guanyi Chu, Xiao Wang, Chuan Shi, and Xunqiang Jiang. 2021. CuCo: Graph Representation with Curriculum Contrastive Learning.. InIJCAI. 2300–2306. doi:10.24963/ijcai.2021/317

work page doi:10.24963/ijcai.2021/317 2021
[14]

Samip Dahal, Adyasha Maharana, and Mohit Bansal. 2022. Scotch: A Semantic Code Search Engine for IDEs. InDeep Learning for Code Workshop. https://openreview.net/forum?id=rSxfCiOZk-c

2022
[15]

Jingtao Ding, Yuhan Quan, Quanming Yao, Yong Li, and Depeng Jin. 2020. Simplify and robustify negative sampling for implicit collaborative filtering.Advances in Neural Information Processing Systems33 (2020), 1094–1105. doi:10. 48550/arXiv.2009.03376

arXiv 2020
[16]

Hongchao Fang and Pengtao Xie. 2020. CERT: Contrastive Self-supervised Learning for Language Understanding. CoRRabs/2005.12766 (2020). arXiv:2005.12766 doi:10.48550/arXiv.2005.12766

work page doi:10.48550/arxiv.2005.12766 2020
[17]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[18]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Ed...

work page doi:10.18653/v1/2021.emnlp- 2021
[19]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. InProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.). ACM, 933–944. doi:10.1145/3180155.3180167

work page doi:10.1145/3180155.3180167 2018
[20]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicenci...

work page doi:10.18653/v1/2022.acl-long.499 2022
[21]

Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In9th International Conference on Learning Repre...

Pith/arXiv arXiv 2021
[22]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021). doi:10.48550/arXiv.2105.09938

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.09938 2021
[23]

Walker, and Jörg Denzinger

Reid Holmes, Rylan Cottrell, Robert J. Walker, and Jörg Denzinger. 2009. The end-to-end use of source code examples: An exploratory study. In25th IEEE International Conference on Software Maintenance (ICSM 2009), September 20-26, Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. FSE005:22 Ye Fan, Jidong Ge, Chuanyi Li, L...

work page doi:10.1109/icsm.2009.5306387 2009
[24]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...

work page doi:10.18653/v1/2021.acl-long.442 2021
[25]

Xiangbing Huang, Yingwei Ma, Haifang Zhou, Zhijie Jiang, Yuanliang Zhang, Teng Wang, and Shanshan Li. 2023. Towards better multilingual code search through cross-lingual contrastive learning. InProceedings of the 14th Asia- Pacific Symposium on Internetware. 22–32. doi:10.1145/3609437.3609439

work page doi:10.1145/3609437.3609439 2023
[26]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.CoRRabs/1909.09436 (2019). arXiv:1909.09436 doi:10.48550/ arXiv.1909.09436

Pith/arXiv arXiv 2019
[27]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021). doi:10.48550/arXiv.2112.09118

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.09118 2021
[28]

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph Gonzalez, and Ion Stoica. 2021. Contrastive Code Representation Learning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, ...

2021
[29]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, a...

work page doi:10.18653/v1/2020 2020
[30]

Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.). ACM, 664–675. doi:10.1145/2568225.2568292

work page doi:10.1145/2568225.2568292 2014
[31]

Rui Li, Liyang He, Qi Liu, Yuze Zhao, Zheng Zhang, Zhenya Huang, Yu Su, and Shijin Wang. 2024. Consider: Commonalities and specialties driven multilingual code retrieval framework. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8679–8687. doi:10.1609/aaai.v38i8.28713

work page doi:10.1609/aaai.v38i8.28713 2024
[32]

Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. CodeRetriever: Unimodal and Bimodal Contrastive Learning.CoRRabs/2201.10866 (2022). arXiv:2201.10866 doi:10.48550/arXiv.2201.10866

work page doi:10.48550/arxiv.2201.10866 2022
[33]

Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei. 2016. Relationship-aware code search for JavaScript frameworks. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su (Eds.). ACM...

work page doi:10.1145/2950290.2950341 2016
[34]

Erik Linstead, Sushil Krishna Bajracharya, Trung Chi Ngo, Paul Rigor, Cristina Videira Lopes, and Pierre Baldi. 2009. Sourcerer: mining and searching internet-scale software repositories.Data Min. Knowl. Discov.18, 2 (2009), 300–336. doi:10.1007/S10618-008-0118-X

work page doi:10.1007/s10618-008-0118-x 2009
[35]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Unders...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.04664 2021
[36]

Fei Lv, Hongyu Zhang, Jian-Guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E). In30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015, Myra B. Cohen, Lars Grunske, and Michael Whalen (Ed...

work page doi:10.1109/ase.2015.42 2015
[37]

Xiaofei Ma, Cícero Nogueira dos Santos, and Andrew O. Arnold. 2021. Contrastive Fine-tuning Improves Robustness for Neural Rankers. InFindings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association...

work page doi:10.18653/v1/2021.findings-acl.51 2021
[38]

Yingwei Ma, Yue Yu, Shanshan Li, Zhouyang Jia, Jun Ma, Rulin Xu, Wei Dong, and Xiangke Liao. 2023. Mulcs: Towards a unified deep representation for multilingual code search. In2023 IEEE International Conference on Software Analysis, Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. UNICS: Multilingual Code Search via Uni...

work page doi:10.1109/saner56733.2023.00021 2023
[39]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Chen Fu, and Qing Xie. 2012. Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications.IEEE Trans. Software Eng.38, 5 (2012), 1069–1087. doi:10.1109/TSE. 2011.84

work page doi:10.1109/tse 2012
[40]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: finding relevant functions and their usage. InProceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, USA, May 21-28, 2011, Richard N. Taylor, Harald C. Gall, and Nenad Medvidovic (Eds.). ACM, 111–120. doi:10.1145...

work page doi:10.1145/1985793.1985809 2011
[41]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. 2022. Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005(2022). doi:10.48550/arXiv.2201.10005

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.10005 2022
[42]

Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. 2022. Deep Learning Meets Software Engineering: A Survey on Pre- Trained Models of Source Code. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, Luc De Raedt (Ed.). ijcai.org, 5546–5555. doi:10.24963/IJCAI.2022/775

work page doi:10.24963/ijcai.2022/775 2022
[43]

Luca Ponzanelli, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Michele Lanza. 2014. Mining StackOverflow to turn the IDE into a self-confident programming prompter. In11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, Premkumar T. Devanbu, Sung Kim, and Martin Pinzger (Eds....

work page doi:10.1145/2597073.2597077 2014
[44]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: a neural code search. InProceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, Justin Gottschlich and Alvin Cheung (Eds.). ACM, 31–4...

work page doi:10.1145/3211346.3211353 2018
[45]

Ensheng Shi, Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun
[46]

arXiv:2204.03293 doi:10.48550/ARXIV.2204.03293

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation.CoRR abs/2204.03293 (2022). arXiv:2204.03293 doi:10.48550/ARXIV.2204.03293

work page doi:10.48550/arxiv.2204.03293 2022
[47]

Ensheng Shi, Yanlin Wang, Wenchao Gu, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023. CoCoSoDa: Effective Contrastive Learning for Code Search. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2198–2210. doi:10.1109/ICSE48619.2023. 00185

work page doi:10.1109/icse48619.2023 2023
[48]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving Code Search with Co-Attentive Representation Learning. InICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, 196–207. doi:10.1145/3387904.3389269

work page doi:10.1145/3387904.3389269 2020
[49]

Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia

Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 476–480. doi:10.1109/ICSME.2014.77

work page doi:10.1109/icsme.2014.77 2014
[50]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode compose: code generation using transformer. InESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds...

work page doi:10.1145/3368089.3417058 2020
[51]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, 13–25. doi:10.1109/ASE.2019.00012

work page doi:10.1109/ase.2019.00012 2019
[52]

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow- Augmented Abstract Syntax Tree. In27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020, Kostas Kontogiannis, Foutse Khomh, Alexander Chatzigeorgiou, Marios-El...

work page doi:10.1109/saner48275.2020.9054857 2020
[53]

Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation.arXiv preprint arXiv:2108.04556 (2021). doi:10.48550/arXiv.2108.04556

work page doi:10.48550/arxiv.2108.04556 2021
[54]

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Type Ranking Measures. InCOLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA (JMLR Workshop and Conference Proceedings, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). JMLR.org, 25–54. doi:10.485...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1304.6480 2013
[55]

Joty, and Steven C

Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 Proc. ACM Softw. Eng., Vol...

work page doi:10.18653/v1/2021.emnlp-main.685 2021
[56]

Jun Xia, Lirong Wu, Ge Wang, Jintao Chen, and Stan Z Li. 2021. Progcl: Rethinking hard negative mining in graph contrastive learning.arXiv preprint arXiv:2110.02027(2021). doi:10.48550/arXiv.2110.02027

work page doi:10.48550/arxiv.2110.02027 2021
[57]

Huang Xie, Okko Räsänen, and Tuomas Virtanen. 2023. On negative sampling for contrastive audio-text retrieval. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. doi:10.1109/ICASSP49357.2023.10095319

work page doi:10.1109/icassp49357.2023.10095319 2023
[58]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. In27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020, Kostas Konto...

work page doi:10.1109/saner48275.2020.9054840 2020
[59]

Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023. Codetransocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951(2023). doi:10.48550/arXiv.2310.04951

work page doi:10.48550/arxiv.2310.04951 2023
[60]

Hongyu Zhang, Anuj Jain, Gaurav Khandelwal, Chandrashekhar Kaushik, Scott Ge, and Wenxiang Hu. 2016. Bing developer assistant: improving developer productivity by recommending sample code. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmerma...

work page doi:10.1145/2950290.2983955 2016
[61]

Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Graum...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.07836 2018
[62]

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024. Opencodeinterpreter: Integrating code generation with execution and refinement.arXiv preprint arXiv:2402.14658 (2024). doi:10.48550/arXiv.2402.14658

work page doi:10.48550/arxiv.2402.14658 2024
[63]

Jing Zhou and Robert J. Walker. 2016. API deprecation: a retrospective analysis and detection method for code examples on the web. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su (Eds.). ACM, 266–2...

work page doi:10.1145/2950290.2950298 2016
[64]

Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K. Reddy. 2022. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence. arXiv:2206.08474 doi:10.48550/arXiv.2206.08474 Received 2025-09-12; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026

work page doi:10.48550/arxiv.2206.08474 2022

[1] [1]

Github Website

2023. Github Website. https://www.github.com

2023

[2] [2]

Code Embedding

2026. Code Embedding. https://bitbucket.org/anonymous_code/code_embedding

2026

[3] [3]

Shushan Arakelyan, Anna Hakhverdyan, Miltiadis Allamanis, Luis Garcia, Christophe Hauser, and Xiang Ren. 2022. NS3: Neuro-symbolic Semantic Code Search. (2022). doi:10.52202/068431-0761

work page doi:10.52202/068431-0761 2022

[4] [4]

Joel Brandt, Mira Dontcheva, Marcos Weskamp, and Scott R. Klemmer. 2010. Example-centric programming: integrating web search into the development environment. InProceedings of the 28th International Conference on Human Factors in Computing Systems, CHI 2010, Atlanta, Georgia, USA, April 10-15, 2010, Elizabeth D. Mynatt, Don Schoner, Geraldine Fitzpatrick,...

work page doi:10.1145/1753326.1753402 2010

[5] [5]

Guo, Joel Lewenstein, Mira Dontcheva, and Scott R

Joel Brandt, Philip J. Guo, Joel Lewenstein, Mira Dontcheva, and Scott R. Klemmer. 2009. Two studies of opportunistic programming: interleaving web foraging, learning, and writing code. InProceedings of the 27th International Conference on Human Factors in Computing Systems, CHI 2009, Boston, MA, USA, April 4-9, 2009, Dan R. Olsen Jr., Richard B. Arthur, ...

arXiv 2009

[6] [6]

Nghi D. Q. Bui, Yijun Yu, and Lingxiao Jiang. 2021. Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations. InSIGIR ’21: The 44th International ACM SIGIR Conference Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. UNICS: Multilingual Code Search via Unified Pseu...

work page doi:10.1145/3404835.3462840 2021

[7] [7]

Róbert Busa-Fekete, György Szarvas, Tamás Elteto, and Balázs Kégl. 2012. An apple-to-apple comparison of learning- to-rank algorithms in terms of normalized discounted cumulative gain. InECAI 2012-20th European Conference on Artificial Intelligence: Preference Learning: Problems and Applications in AI Workshop, Vol. 242. Ios Press

2012

[8] [8]

José Cambronero, Hongyu Li, Seohyun Kim, Koushik Sen, and Satish Chandra. 2019. When deep learning met code search. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel,...

work page doi:10.1145/3338906.3340458 2019

[9] [9]

Brock Angus Campbell and Christoph Treude. 2017. NLP2Code: Code Snippet Content Assist via Natural Language Tasks. In2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, September 17-22, 2017. IEEE Computer Society, 628–632. doi:10.1109/ICSME.2017.56

work page doi:10.1109/icsme.2017.56 2017

[10] [10]

Yitian Chai, Hongyu Zhang, Beijun Shen, and Xiaodong Gu. 2022. Cross-Domain Deep Code Search with Meta Learning. In44th IEEE/ACM 44th International Conference on Software Engineering, ICSE 2022, Pittsburgh, PA, USA, May 25-27, 2022. ACM, 487–498. doi:10.1145/3510003.3510125

work page doi:10.1145/3510003.3510125 2022

[11] [11]

Wing-Kwan Chan, Hong Cheng, and David Lo. 2012. Searching connected API subgraph via text phrases. In20th ACM SIGSOFT Symposium on the Foundations of Software Engineering (FSE-20), SIGSOFT/FSE’12, Cary, NC, USA - November 11 - 16, 2012, Will Tracz, Martin P. Robillard, and Tevfik Bultan (Eds.). ACM, 10. doi:10.1145/2393596.2393606

work page doi:10.1145/2393596.2393606 2012

[12] [12]

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. Bge m3-embedding: Multi- lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216(2024). doi:10.48550/arXiv.2402.03216

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03216 2024

[13] [13]

Guanyi Chu, Xiao Wang, Chuan Shi, and Xunqiang Jiang. 2021. CuCo: Graph Representation with Curriculum Contrastive Learning.. InIJCAI. 2300–2306. doi:10.24963/ijcai.2021/317

work page doi:10.24963/ijcai.2021/317 2021

[14] [14]

Samip Dahal, Adyasha Maharana, and Mohit Bansal. 2022. Scotch: A Semantic Code Search Engine for IDEs. InDeep Learning for Code Workshop. https://openreview.net/forum?id=rSxfCiOZk-c

2022

[15] [15]

Jingtao Ding, Yuhan Quan, Quanming Yao, Yong Li, and Depeng Jin. 2020. Simplify and robustify negative sampling for implicit collaborative filtering.Advances in Neural Information Processing Systems33 (2020), 1094–1105. doi:10. 48550/arXiv.2009.03376

arXiv 2020

[16] [16]

Hongchao Fang and Pengtao Xie. 2020. CERT: Contrastive Self-supervised Learning for Language Understanding. CoRRabs/2005.12766 (2020). arXiv:2005.12766 doi:10.48550/arXiv.2005.12766

work page doi:10.48550/arxiv.2005.12766 2020

[17] [17]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020

[18] [18]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Ed...

work page doi:10.18653/v1/2021.emnlp- 2021

[19] [19]

Xiaodong Gu, Hongyu Zhang, and Sunghun Kim. 2018. Deep code search. InProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Harman (Eds.). ACM, 933–944. doi:10.1145/3180155.3180167

work page doi:10.1145/3180155.3180167 2018

[20] [20]

Daya Guo, Shuai Lu, Nan Duan, Yanlin Wang, Ming Zhou, and Jian Yin. 2022. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicenci...

work page doi:10.18653/v1/2022.acl-long.499 2022

[21] [21]

Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie Liu, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin B. Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. GraphCodeBERT: Pre-training Code Representations with Data Flow. In9th International Conference on Learning Repre...

Pith/arXiv arXiv 2021

[22] [22]

Dan Hendrycks, Steven Basart, Saurav Kadavath, Mantas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Samir Puranik, Horace He, Dawn Song, et al. 2021. Measuring coding challenge competence with apps.arXiv preprint arXiv:2105.09938 (2021). doi:10.48550/arXiv.2105.09938

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2105.09938 2021

[23] [23]

Walker, and Jörg Denzinger

Reid Holmes, Rylan Cottrell, Robert J. Walker, and Jörg Denzinger. 2009. The end-to-end use of source code examples: An exploratory study. In25th IEEE International Conference on Software Maintenance (ICSM 2009), September 20-26, Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. FSE005:22 Ye Fan, Jidong Ge, Chuanyi Li, L...

work page doi:10.1109/icsm.2009.5306387 2009

[24] [24]

Junjie Huang, Duyu Tang, Linjun Shou, Ming Gong, Ke Xu, Daxin Jiang, Ming Zhou, and Nan Duan. 2021. CoSQA: 20, 000+ Web Queries for Code Search and Question Answering. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volum...

work page doi:10.18653/v1/2021.acl-long.442 2021

[25] [25]

Xiangbing Huang, Yingwei Ma, Haifang Zhou, Zhijie Jiang, Yuanliang Zhang, Teng Wang, and Shanshan Li. 2023. Towards better multilingual code search through cross-lingual contrastive learning. InProceedings of the 14th Asia- Pacific Symposium on Internetware. 22–32. doi:10.1145/3609437.3609439

work page doi:10.1145/3609437.3609439 2023

[26] [26]

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. 2019. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search.CoRRabs/1909.09436 (2019). arXiv:1909.09436 doi:10.48550/ arXiv.1909.09436

Pith/arXiv arXiv 2019

[27] [27]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2021. Unsupervised dense information retrieval with contrastive learning.arXiv preprint arXiv:2112.09118 (2021). doi:10.48550/arXiv.2112.09118

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2112.09118 2021

[28] [28]

Paras Jain, Ajay Jain, Tianjun Zhang, Pieter Abbeel, Joseph Gonzalez, and Ion Stoica. 2021. Contrastive Code Representation Learning. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, ...

2021

[29] [29]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, a...

work page doi:10.18653/v1/2020 2020

[30] [30]

Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In36th International Conference on Software Engineering, ICSE ’14, Hyderabad, India - May 31 - June 07, 2014, Pankaj Jalote, Lionel C. Briand, and André van der Hoek (Eds.). ACM, 664–675. doi:10.1145/2568225.2568292

work page doi:10.1145/2568225.2568292 2014

[31] [31]

Rui Li, Liyang He, Qi Liu, Yuze Zhao, Zheng Zhang, Zhenya Huang, Yu Su, and Shijin Wang. 2024. Consider: Commonalities and specialties driven multilingual code retrieval framework. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 8679–8687. doi:10.1609/aaai.v38i8.28713

work page doi:10.1609/aaai.v38i8.28713 2024

[32] [32]

Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. CodeRetriever: Unimodal and Bimodal Contrastive Learning.CoRRabs/2201.10866 (2022). arXiv:2201.10866 doi:10.48550/arXiv.2201.10866

work page doi:10.48550/arxiv.2201.10866 2022

[33] [33]

Xuan Li, Zerui Wang, Qianxiang Wang, Shoumeng Yan, Tao Xie, and Hong Mei. 2016. Relationship-aware code search for JavaScript frameworks. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su (Eds.). ACM...

work page doi:10.1145/2950290.2950341 2016

[34] [34]

Erik Linstead, Sushil Krishna Bajracharya, Trung Chi Ngo, Paul Rigor, Cristina Videira Lopes, and Pierre Baldi. 2009. Sourcerer: mining and searching internet-scale software repositories.Data Min. Knowl. Discov.18, 2 (2009), 300–336. doi:10.1007/S10618-008-0118-X

work page doi:10.1007/s10618-008-0118-x 2009

[35] [35]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. 2021. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Unders...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.04664 2021

[36] [36]

Fei Lv, Hongyu Zhang, Jian-Guang Lou, Shaowei Wang, Dongmei Zhang, and Jianjun Zhao. 2015. CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E). In30th IEEE/ACM International Conference on Automated Software Engineering, ASE 2015, Lincoln, NE, USA, November 9-13, 2015, Myra B. Cohen, Lars Grunske, and Michael Whalen (Ed...

work page doi:10.1109/ase.2015.42 2015

[37] [37]

Xiaofei Ma, Cícero Nogueira dos Santos, and Andrew O. Arnold. 2021. Contrastive Fine-tuning Improves Robustness for Neural Rankers. InFindings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021 (Findings of ACL, Vol. ACL/IJCNLP 2021), Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association...

work page doi:10.18653/v1/2021.findings-acl.51 2021

[38] [38]

Yingwei Ma, Yue Yu, Shanshan Li, Zhouyang Jia, Jun Ma, Rulin Xu, Wei Dong, and Xiangke Liao. 2023. Mulcs: Towards a unified deep representation for multilingual code search. In2023 IEEE International Conference on Software Analysis, Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026. UNICS: Multilingual Code Search via Uni...

work page doi:10.1109/saner56733.2023.00021 2023

[39] [39]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Chen Fu, and Qing Xie. 2012. Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications.IEEE Trans. Software Eng.38, 5 (2012), 1069–1087. doi:10.1109/TSE. 2011.84

work page doi:10.1109/tse 2012

[40] [40]

Collin McMillan, Mark Grechanik, Denys Poshyvanyk, Qing Xie, and Chen Fu. 2011. Portfolio: finding relevant functions and their usage. InProceedings of the 33rd International Conference on Software Engineering, ICSE 2011, Waikiki, Honolulu , HI, USA, May 21-28, 2011, Richard N. Taylor, Harald C. Gall, and Nenad Medvidovic (Eds.). ACM, 111–120. doi:10.1145...

work page doi:10.1145/1985793.1985809 2011

[41] [41]

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. 2022. Text and code embeddings by contrastive pre-training.arXiv preprint arXiv:2201.10005(2022). doi:10.48550/arXiv.2201.10005

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.10005 2022

[42] [42]

Changan Niu, Chuanyi Li, Bin Luo, and Vincent Ng. 2022. Deep Learning Meets Software Engineering: A Survey on Pre- Trained Models of Source Code. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, Luc De Raedt (Ed.). ijcai.org, 5546–5555. doi:10.24963/IJCAI.2022/775

work page doi:10.24963/ijcai.2022/775 2022

[43] [43]

Luca Ponzanelli, Gabriele Bavota, Massimiliano Di Penta, Rocco Oliveto, and Michele Lanza. 2014. Mining StackOverflow to turn the IDE into a self-confident programming prompter. In11th Working Conference on Mining Software Repositories, MSR 2014, Proceedings, May 31 - June 1, 2014, Hyderabad, India, Premkumar T. Devanbu, Sung Kim, and Martin Pinzger (Eds....

work page doi:10.1145/2597073.2597077 2014

[44] [44]

Saksham Sachdev, Hongyu Li, Sifei Luan, Seohyun Kim, Koushik Sen, and Satish Chandra. 2018. Retrieval on source code: a neural code search. InProceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18-22, 2018, Justin Gottschlich and Alvin Cheung (Eds.). ACM, 31–4...

work page doi:10.1145/3211346.3211353 2018

[45] [45]

Ensheng Shi, Wenchao Gu, Yanlin Wang, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun

[46] [46]

arXiv:2204.03293 doi:10.48550/ARXIV.2204.03293

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation.CoRR abs/2204.03293 (2022). arXiv:2204.03293 doi:10.48550/ARXIV.2204.03293

work page doi:10.48550/arxiv.2204.03293 2022

[47] [47]

Ensheng Shi, Yanlin Wang, Wenchao Gu, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. 2023. CoCoSoDa: Effective Contrastive Learning for Code Search. InProceedings of the 45th International Conference on Software Engineering(Melbourne, Victoria, Australia)(ICSE ’23). IEEE Press, 2198–2210. doi:10.1109/ICSE48619.2023. 00185

work page doi:10.1109/icse48619.2023 2023

[48] [48]

Jianhang Shuai, Ling Xu, Chao Liu, Meng Yan, Xin Xia, and Yan Lei. 2020. Improving Code Search with Co-Attentive Representation Learning. InICPC ’20: 28th International Conference on Program Comprehension, Seoul, Republic of Korea, July 13-15, 2020. ACM, 196–207. doi:10.1145/3387904.3389269

work page doi:10.1145/3387904.3389269 2020

[49] [49]

Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia

Jeffrey Svajlenko, Judith F. Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a Big Data Curated Benchmark of Inter-project Code Clones. In30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 476–480. doi:10.1109/ICSME.2014.77

work page doi:10.1109/icsme.2014.77 2014

[50] [50]

Alexey Svyatkovskiy, Shao Kun Deng, Shengyu Fu, and Neel Sundaresan. 2020. IntelliCode compose: code generation using transformer. InESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, Prem Devanbu, Myra B. Cohen, and Thomas Zimmermann (Eds...

work page doi:10.1145/3368089.3417058 2020

[51] [51]

Yao Wan, Jingdong Shu, Yulei Sui, Guandong Xu, Zhou Zhao, Jian Wu, and Philip S. Yu. 2019. Multi-modal Attention Network Learning for Semantic Source Code Retrieval. In34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, 13–25. doi:10.1109/ASE.2019.00012

work page doi:10.1109/ase.2019.00012 2019

[52] [52]

Wenhan Wang, Ge Li, Bo Ma, Xin Xia, and Zhi Jin. 2020. Detecting Code Clones with Graph Neural Network and Flow- Augmented Abstract Syntax Tree. In27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020, Kostas Kontogiannis, Foutse Khomh, Alexander Chatzigeorgiou, Marios-El...

work page doi:10.1109/saner48275.2020.9054857 2020

[53] [53]

Xin Wang, Yasheng Wang, Fei Mi, Pingyi Zhou, Yao Wan, Xiao Liu, Li Li, Hao Wu, Jin Liu, and Xin Jiang. 2021. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation.arXiv preprint arXiv:2108.04556 (2021). doi:10.48550/arXiv.2108.04556

work page doi:10.48550/arxiv.2108.04556 2021

[54] [54]

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A Theoretical Analysis of NDCG Type Ranking Measures. InCOLT 2013 - The 26th Annual Conference on Learning Theory, June 12-14, 2013, Princeton University, NJ, USA (JMLR Workshop and Conference Proceedings, Vol. 30), Shai Shalev-Shwartz and Ingo Steinwart (Eds.). JMLR.org, 25–54. doi:10.485...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1304.6480 2013

[55] [55]

Joty, and Steven C

Yue Wang, Weishi Wang, Shafiq R. Joty, and Steven C. H. Hoi. 2021. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. InProceedings of the 2021 Conference on Em- pirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 Proc. ACM Softw. Eng., Vol...

work page doi:10.18653/v1/2021.emnlp-main.685 2021

[56] [56]

Jun Xia, Lirong Wu, Ge Wang, Jintao Chen, and Stan Z Li. 2021. Progcl: Rethinking hard negative mining in graph contrastive learning.arXiv preprint arXiv:2110.02027(2021). doi:10.48550/arXiv.2110.02027

work page doi:10.48550/arxiv.2110.02027 2021

[57] [57]

Huang Xie, Okko Räsänen, and Tuomas Virtanen. 2023. On negative sampling for contrastive audio-text retrieval. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5. doi:10.1109/ICASSP49357.2023.10095319

work page doi:10.1109/icassp49357.2023.10095319 2023

[58] [58]

Shuhan Yan, Hang Yu, Yuting Chen, Beijun Shen, and Lingxiao Jiang. 2020. Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries. In27th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2020, London, ON, Canada, February 18-21, 2020, Kostas Konto...

work page doi:10.1109/saner48275.2020.9054840 2020

[59] [59]

Weixiang Yan, Yuchen Tian, Yunzhe Li, Qian Chen, and Wen Wang. 2023. Codetransocean: A comprehensive multilingual benchmark for code translation.arXiv preprint arXiv:2310.04951(2023). doi:10.48550/arXiv.2310.04951

work page doi:10.48550/arxiv.2310.04951 2023

[60] [60]

Hongyu Zhang, Anuj Jain, Gaurav Khandelwal, Chandrashekhar Kaushik, Scott Ge, and Wenxiang Hu. 2016. Bing developer assistant: improving developer productivity by recommending sample code. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmerma...

work page doi:10.1145/2950290.2983955 2016

[61] [61]

Zhilu Zhang and Mert R. Sabuncu. 2018. Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels. InAdvances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Graum...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1805.07836 2018

[62] [62]

Tianyu Zheng, Ge Zhang, Tianhao Shen, Xueling Liu, Bill Yuchen Lin, Jie Fu, Wenhu Chen, and Xiang Yue. 2024. Opencodeinterpreter: Integrating code generation with execution and refinement.arXiv preprint arXiv:2402.14658 (2024). doi:10.48550/arXiv.2402.14658

work page doi:10.48550/arxiv.2402.14658 2024

[63] [63]

Jing Zhou and Robert J. Walker. 2016. API deprecation: a retrospective analysis and detection method for code examples on the web. InProceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering, FSE 2016, Seattle, W A, USA, November 13-18, 2016, Thomas Zimmermann, Jane Cleland-Huang, and Zhendong Su (Eds.). ACM, 266–2...

work page doi:10.1145/2950290.2950298 2016

[64] [64]

Ming Zhu, Aneesh Jain, Karthik Suresh, Roshan Ravindran, Sindhu Tipirneni, and Chandan K. Reddy. 2022. XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence. arXiv:2206.08474 doi:10.48550/arXiv.2206.08474 Received 2025-09-12; accepted 2025-12-22 Proc. ACM Softw. Eng., Vol. 3, No. FSE, Article FSE005. Publication date: July 2026

work page doi:10.48550/arxiv.2206.08474 2022