Beyond Retrieval: A Multitask Benchmark and Model for Code Search

· 2026 · cs.SE · arXiv 2605.04615

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

representative citing papers

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

cs.AI · 2026-05-28 · unverdicted · novelty 6.0

SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.

ZooClaw-FashionSigLIP2: Distilled Fine-tuning for Robust Fashion Retrieval

cs.CV · 2026-06-26 · unverdicted · novelty 4.0

ZooClaw-FashionSigLIP2 applies distilled full fine-tuning plus WiseFT interpolation to SigLIP2-base and reports outperforming LoRA, larger backbones, and external data on fashion retrieval benchmarks while releasing a new benchmark and bias analysis.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Make LLM Learn to Synthesize from Streaming Experiences through Feedback cs.AI · 2026-05-28 · unverdicted · none · ref 32 · internal anchor
SynLearner lets LLMs improve synthetic data generation on later tasks in a stream by learning reusable patterns and balancing quality with diversity from feedback on earlier tasks.

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

fields

years

verdicts

representative citing papers

citing papers explorer