Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Fan Zhou; Hang Yu; Jin Qin; Siqiao Xue; Yixiang Mu; Zihan Liao; Ziyin Zhang

arxiv: 2605.04615 · v2 · submitted 2026-05-06 · 💻 cs.SE · cs.AI

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Siqiao Xue , Zihan Liao , Jin Qin , Ziyin Zhang , Yixiang Mu , Fan Zhou , Hang Yu This is my paper

Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords code searchrerankingbenchmarkmultitaskcode retrievaltext-to-codecode-to-textgraded relevance

0 comments

The pith

A fine-tuned reranker delivers consistent gains across text-to-code, code-to-text, and code-to-code tasks where prior models do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CoREB, a multitask benchmark for the full code search pipeline that covers retrieval followed by reranking. It builds the data from counterfactually rewritten LiveCodeBench problems in five languages, released on a timed schedule with graded relevance judgments to limit contamination and noise. The authors test eleven embedding models and five rerankers on the three tasks and show that code-specialized embeddings help on code-to-code but no model wins everything, short keyword queries collapse performance for all, and off-the-shelf rerankers swing widely by task with no net benefit. Their fine-tuned CoREB-Reranker is the first to improve results across every task. This matters because production code search uses these combined stages, yet benchmarks have long focused only on initial retrieval.

Core claim

The central claim is that a reranker fine-tuned on the CoREB multitask data becomes the first model to achieve consistent gains across text-to-code, code-to-text, and code-to-code tasks, while code-specialized embeddings dominate only code-to-code retrieval and off-the-shelf rerankers remain task-asymmetric with no baseline that helps on all three.

What carries the argument

The CoREB benchmark, built from counterfactually rewritten LiveCodeBench problems with timed releases and graded relevance judgments, together with the CoREB-Reranker fine-tuned on its multitask retrieval and reranking data.

If this is right

Code-specialized embedding models achieve roughly twice the performance of general encoders on code-to-code retrieval.
Short keyword queries, the format closest to actual developer use, drive every model to near-zero nDCG@10.
Off-the-shelf rerankers exhibit large task asymmetry, with up to 12-point swings and no net-positive result across all tasks.
Fine-tuning a reranker on the multitask benchmark overcomes the task-specific limitations seen in prior models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Timed benchmark releases could allow repeated testing while tracking contamination over time.
Counterfactual rewriting of problems might be used to create cleaner benchmarks for other retrieval or generation tasks.
Embedding multitask rerankers into search tools could raise the quality of code suggestions that developers actually encounter.

Load-bearing premise

The counterfactually rewritten problems produce graded relevance judgments that stay free of label noise and match original developer intent without new biases.

What would settle it

A new model or reranker that shows no improvement over strong baselines when tested on independently collected real developer code search queries with fresh relevance labels would falsify the claim of consistent gains.

Figures

Figures reproduced from arXiv: 2605.04615 by Fan Zhou, Hang Yu, Jin Qin, Siqiao Xue, Yixiang Mu, Zihan Liao, Ziyin Zhang.

**Figure 1.** Figure 1: Aggregated query distribution across both C view at source ↗

**Figure 2.** Figure 2: Benchmark construction pipeline. Each step is detailed in the corresponding paragraph below. view at source ↗

**Figure 3.** Figure 3: Pass@1 change (pp) after rewriting for Gemini 3 Flash across two releases covering different contest windows. Step 2: Counterfactual rewriting and code generation. We apply light counterfactual rewriting (Wu et al., 2024) to each problem’s statement and test cases by modifying named entities, variable names, narrative framing, and I/O examples while preserving the formal specification and algorithmic str… view at source ↗

**Figure 2.** Figure 2: Benchmark construction pipeline. Each step is detailed in the corresponding paragraph below. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: nDCG@k at k∈ {1, 5, 10} per task, averaged over all eleven models on v202603. Codeto-text saturates early; text-to-code and code-tocode grow more steeply with k. MODEL CANONICAL FULL SEARCH GEMEMB-2 0.573 0.565 0.000 C2LLM-7B 0.582 0.578 0.004 C2LLM-0.5B 0.566 0.560 0.003 F2LLM-4B 0.529 0.535 0.007 JINA-CODE-1.5B 0.539 0.543 0.006 F2LLM-1.7B 0.481 0.520 0.008 QWEN3-4B 0.510 0.504 0.015 QWEN3-0.6B 0.452 … view at source ↗

**Figure 6.** Figure 6: Text-to-code nDCG@10 by target language on v202603 (excluding Search subtask). Full per-subtask and per-language tables are in Appendix B.6. Analysis V: Do hard negatives provide additional evaluation signal? A key design choice in COREB is the explicit inclusion of same-problem hard negatives (relevance=1) in the qrels. Unlike benchmarks with binary or absent-is-irrelevant qrels, our graded scheme expose… view at source ↗

**Figure 7.** Figure 7: Hard-negative intrusion rate: fraction of queries where at least one hard negative ranks above the best view at source ↗

**Figure 8.** Figure 8: ∆ nDCG@10 (%) after reranking (k= 128) on top of C2LLM-7B. No baseline is net-positive across all three tasks; only our fine-tuned COREBRERANKER achieves this. Main results on reranking. We rerank the top-128 candidates retrieved by C2LLM-7B (the strongest open-weight retriever) with four baseline rerankers: Jina Reranker v2, Jina Reranker v3, Qwen3-Reranker-0.6B, and Qwen3-Reranker-4B view at source ↗

**Figure 9.** Figure 9: Overall nDCG@10 vs. parameter count (log scale) for the ten open-weight models. Circles = code-specialized; squares = general-purpose. The dashed line marks the Pareto frontier. GemEmb-2 is excluded due to its unknown parameter size. 0.0 0.5 1.0 1.5 2.0 Overall nDCG@10 / billion parameters C2LLM-0.5B Jina-code-0.5B Qwen3-0.6B F2LLM-0.6B F2LLM-4B Qwen3-4B C2LLM-7B Qwen3-8B 1.21 1.19 0.74 0.73 0.14 0.12 0.09… view at source ↗

**Figure 11.** Figure 11: Query distribution per COREB release: v202602 (2,604 queries, top row) and v202603 (2,483 queries, bottom row). The two releases exhibit nearly identical proportions; see view at source ↗

**Figure 12.** Figure 12: Code-to-text nDCG@10 by subtask type on v202603. How do models handle short keyword queries? The text-to-code Search subtask uses short developer-style queries (19 tokens on average). On this subtask every model collapses to nearzero nDCG@10, two orders of magnitude below the Canonical subtask ( view at source ↗

**Figure 13.** Figure 13: Full model parameter efficiency (expanded version of Figures view at source ↗

**Figure 14.** Figure 14: Recall@ view at source ↗

read the original abstract

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoREB, a contamination-limited multitask benchmark for code retrieval and reranking constructed from counterfactually rewritten LiveCodeBench problems across five languages with graded relevance judgments. It evaluates 11 embedding models and 5 rerankers on text-to-code, code-to-text, and code-to-code tasks, highlighting asymmetries in model performance (e.g., code-specialized embeddings excelling on code-to-code but no universal winner) and query format effects, while proposing a fine-tuned CoREB-Reranker that achieves the first consistent nDCG@10 gains across all three tasks. Data and model are released.

Significance. If the benchmark construction and graded labels prove robust, this advances code search evaluation by moving beyond first-stage retrieval to full pipelines, addressing data contamination, binary relevance, and label noise in prior benchmarks. The multitask design, empirical findings on short-keyword query collapse and reranker task-asymmetry, plus artifact release, provide a useful foundation for more realistic code search research.

major comments (3)

[§3] §3 (Benchmark Construction): The counterfactual rewriting of LiveCodeBench problems to generate graded relevance judgments lacks explicit validation (e.g., human review of intent preservation or semantic equivalence checks). This is load-bearing for all reported nDCG@10 results and the headline claim of consistent reranker gains, as any intent drift or label noise from edits to control flow, variables, or edge cases could artifactually inflate the observed 12-point swings.
[§5.3–5.4] §5.3–5.4 (Experiments and Results): The claim that CoREB-Reranker is the first to achieve net-positive gains across all three tasks is presented without statistical significance tests, confidence intervals, or multiple-run variance for the nDCG@10 differences. This weakens attribution of improvements to the model rather than benchmark-specific effects.
[Table 3] Table 3 (or equivalent performance tables): No ablation compares results on original LiveCodeBench problems versus the rewritten versions, making it impossible to isolate whether the multitask gains stem from the reranker or from properties introduced by the rewriting process itself.

minor comments (2)

[Abstract / §1] The abstract and §1 should clarify the exact timing and versioning mechanism for the 'timed releases' to support reproducibility claims.
[§3.2] Notation for graded relevance (e.g., how 0–3 or similar scales map to nDCG computation) could be stated more explicitly in §3.2 to avoid ambiguity in replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will incorporate revisions to strengthen the manuscript's claims regarding benchmark validity, statistical rigor, and ablation analysis.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The counterfactual rewriting of LiveCodeBench problems to generate graded relevance judgments lacks explicit validation (e.g., human review of intent preservation or semantic equivalence checks). This is load-bearing for all reported nDCG@10 results and the headline claim of consistent reranker gains, as any intent drift or label noise from edits to control flow, variables, or edge cases could artifactually inflate the observed 12-point swings.

Authors: We appreciate this observation on the robustness of our benchmark construction. The rewriting process in §3 followed a structured protocol to preserve core problem intent, functionality, and semantics while introducing controlled counterfactual variations (e.g., variable renaming and control-flow adjustments that do not alter expected outputs). However, we acknowledge that a formal human validation study was not reported in the initial submission. In the revised manuscript, we will expand §3 with the full rewriting guidelines and include results from a targeted human evaluation (on a sample of problems across languages) confirming intent preservation and low label noise. This directly addresses concerns about potential artifacts in the nDCG@10 gains. revision: yes
Referee: [§5.3–5.4] §5.3–5.4 (Experiments and Results): The claim that CoREB-Reranker is the first to achieve net-positive gains across all three tasks is presented without statistical significance tests, confidence intervals, or multiple-run variance for the nDCG@10 differences. This weakens attribution of improvements to the model rather than benchmark-specific effects.

Authors: We agree that the absence of statistical analysis limits the strength of our attribution claims. The original experiments used single-run evaluations on the fixed CoREB splits. In the revised version, we will add bootstrap confidence intervals (with 1000 resamples) for all nDCG@10 scores and apply paired non-parametric tests (Wilcoxon signed-rank) to evaluate whether CoREB-Reranker's improvements over baselines are statistically significant across the three tasks. This will better isolate model contributions from benchmark-specific variance. revision: yes
Referee: [Table 3] Table 3 (or equivalent performance tables): No ablation compares results on original LiveCodeBench problems versus the rewritten versions, making it impossible to isolate whether the multitask gains stem from the reranker or from properties introduced by the rewriting process itself.

Authors: This is a valid methodological concern. While CoREB is designed as a contamination-limited benchmark with timed releases, we did not include a direct comparison to the original LiveCodeBench problems in the submitted manuscript. We will add this ablation in the revised paper: we will evaluate the same embedding models and rerankers on the subset of original LiveCodeBench problems that overlap with our rewritten set (where feasible) and report the delta in nDCG@10. This will help clarify the contribution of the rewriting process versus the reranker itself. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and model evaluation on released data

full rationale

The paper constructs CoREB from counterfactually rewritten LiveCodeBench problems, releases the data and fine-tuned CoREB-Reranker, then reports empirical nDCG@10 results across three tasks for eleven embeddings and five rerankers. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim of consistent gains is an empirical observation on the new benchmark rather than a reduction to prior inputs by construction. The study is self-contained against external benchmarks and released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required beyond standard assumptions in machine learning benchmarking and fine-tuning.

pith-pipeline@v0.9.0 · 5586 in / 1072 out tokens · 34325 ms · 2026-05-11T01:21:39.301354+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our experiments reveal that: ① code-specialised embeddings dominate code-to-code retrieval (~2× over general encoders), yet no single model wins all three tasks; ② short keyword queries... collapse every model to near-zero nDCG@10; ③ off-the-shelf rerankers are task-asymmetric...; ④ our fine-tuned CoREB-Reranker is the first to achieve consistent gains across all three tasks.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

COREB is built from counterfactually rewritten LiveCodeBench problems... with graded relevance judgments... relevance=2 to true positives, relevance=1 to same-problem hard negatives

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

[1]

Provide ONLY executable source code−−no comments, no markdown, no explanations

work page
[2]

Implement a ‘main()‘ function that: −reads all input from stdin, −computes the answer, −writes the result to stdout

work page
[3]

Add any necessary helper functions (without comments)

work page
[4]

Call ‘main()‘ at the bottom of the file

work page
[5]

Alice wants to sort her books

Do not modify the supplied starter skeleton: {{starter code}} Output Format −−−−−−−−−−−−− Wrap the final program in XML−style tags: <code> ...your code... </code> Generate the COMPLETE solution. Do not stop mid−function. Listing 1: Prompt template for code generation. Counterfactual Rewriting.To reduce surface-level contamination, each problem is passed t...

work page
[6]

[1, 2, 3]−>6

For purely numerical test cases (containing only numbers, basic operators, and data structures): −DO NOT MODIFY them at all−keep them exactly as they are −Example: Leave "[1, 2, 3]−>6" or "5 + 10 = 15" unchanged

work page
[7]

count books on shelf

For non−numerical test cases (containing domain−specific terms): −Make MINIMAL changes necessary to match your transformed problem context −PRESERVE the exact same algorithmic structure and complexity −Maintain the same input/output patterns and edge cases −Example: If you changed "count books on shelf" to "count tools in box", then "books=[’novel’,’textb...

work page
[8]

annotate question title

ALL test cases must: −Remain syntactically correct in the target language −Test exactly the same edge cases and functionality −Have the same expected outputs for equivalent inputs ## Your Counterfactual Version Given the original problem: Title:{{question title}} Content: {{question content}} Starter Code: {{starter code}} 15 Public Test Cases: {{public t...

work page
[9]

remove / delete

Operation-type change. Alter the fundamental operation the algorithm must perform (e.g., re- place a “remove / delete” goal with a “select / construct / merge” goal) so that the core algorithmic action is different, not merely renamed

work page
[10]

maxi- mize

Optimization-objective change. Invert or replace the optimization criterion (e.g., change “maxi- mize” to “minimize” or “count distinct”) so that what the algorithm optimizes for changes struc- turally

work page
[11]

Algorithmic-approach change. Replace the algorithmic paradigm required to solve the problem (e.g., subsequence reasoning→contiguous-array or graph problems; greedy→dynamic pro- gramming; two-pointer→binary search) so that the required solution strategy is qualitatively different

work page
[12]

Modified Problem Description:

Problem-domain change. Alter input data types and the problem context (e.g., strings→graphs, arrays→trees), producing a structurally distinct problem that shares no obvious surface mapping to the original. Generated texts are post-processed with a regular-expression pass to strip LLM-produced markdown headers and formatting artifacts (e.g., “Modified Prob...

work page 2076
[13]

Avg. tok

The maximum sequence length is 4,096 tokens, with right-side truncation and left padding. We train the reranker on 8 NVIDIA A100 GPUs. Checkpoint merging.The released COREB-RERANKERcheckpoint is theuniform model soup(Wortsman et al., 2022) of two independently fine-tuned LoRA variants. Both variants share the same base model, LoRA configuration, optimizer...

work page 2022
[14]

Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024)

found that aggressive quality filtering retains only 25.6% of CodeSearchNet-style repository- scraped data, and reports 15–25% test/train near-duplication. Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024). B.12.4 CodeSearchNet-CCR: Str...

work page 2019

[1] [1]

Provide ONLY executable source code−−no comments, no markdown, no explanations

work page

[2] [2]

Implement a ‘main()‘ function that: −reads all input from stdin, −computes the answer, −writes the result to stdout

work page

[3] [3]

Add any necessary helper functions (without comments)

work page

[4] [4]

Call ‘main()‘ at the bottom of the file

work page

[5] [5]

Alice wants to sort her books

Do not modify the supplied starter skeleton: {{starter code}} Output Format −−−−−−−−−−−−− Wrap the final program in XML−style tags: <code> ...your code... </code> Generate the COMPLETE solution. Do not stop mid−function. Listing 1: Prompt template for code generation. Counterfactual Rewriting.To reduce surface-level contamination, each problem is passed t...

work page

[6] [6]

[1, 2, 3]−>6

For purely numerical test cases (containing only numbers, basic operators, and data structures): −DO NOT MODIFY them at all−keep them exactly as they are −Example: Leave "[1, 2, 3]−>6" or "5 + 10 = 15" unchanged

work page

[7] [7]

count books on shelf

For non−numerical test cases (containing domain−specific terms): −Make MINIMAL changes necessary to match your transformed problem context −PRESERVE the exact same algorithmic structure and complexity −Maintain the same input/output patterns and edge cases −Example: If you changed "count books on shelf" to "count tools in box", then "books=[’novel’,’textb...

work page

[8] [8]

annotate question title

ALL test cases must: −Remain syntactically correct in the target language −Test exactly the same edge cases and functionality −Have the same expected outputs for equivalent inputs ## Your Counterfactual Version Given the original problem: Title:{{question title}} Content: {{question content}} Starter Code: {{starter code}} 15 Public Test Cases: {{public t...

work page

[9] [9]

remove / delete

Operation-type change. Alter the fundamental operation the algorithm must perform (e.g., re- place a “remove / delete” goal with a “select / construct / merge” goal) so that the core algorithmic action is different, not merely renamed

work page

[10] [10]

maxi- mize

Optimization-objective change. Invert or replace the optimization criterion (e.g., change “maxi- mize” to “minimize” or “count distinct”) so that what the algorithm optimizes for changes struc- turally

work page

[11] [11]

Algorithmic-approach change. Replace the algorithmic paradigm required to solve the problem (e.g., subsequence reasoning→contiguous-array or graph problems; greedy→dynamic pro- gramming; two-pointer→binary search) so that the required solution strategy is qualitatively different

work page

[12] [12]

Modified Problem Description:

Problem-domain change. Alter input data types and the problem context (e.g., strings→graphs, arrays→trees), producing a structurally distinct problem that shares no obvious surface mapping to the original. Generated texts are post-processed with a regular-expression pass to strip LLM-produced markdown headers and formatting artifacts (e.g., “Modified Prob...

work page 2076

[13] [13]

Avg. tok

The maximum sequence length is 4,096 tokens, with right-side truncation and left padding. We train the reranker on 8 NVIDIA A100 GPUs. Checkpoint merging.The released COREB-RERANKERcheckpoint is theuniform model soup(Wortsman et al., 2022) of two independently fine-tuned LoRA variants. Both variants share the same base model, LoRA configuration, optimizer...

work page 2022

[14] [14]

Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024)

found that aggressive quality filtering retains only 25.6% of CodeSearchNet-style repository- scraped data, and reports 15–25% test/train near-duplication. Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024). B.12.4 CodeSearchNet-CCR: Str...

work page 2019