pith. sign in

arxiv: 2605.04615 · v2 · submitted 2026-05-06 · 💻 cs.SE · cs.AI

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords code searchrerankingbenchmarkmultitaskcode retrievaltext-to-codecode-to-textgraded relevance
0
0 comments X

The pith

A fine-tuned reranker delivers consistent gains across text-to-code, code-to-text, and code-to-code tasks where prior models do not.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces CoREB, a multitask benchmark for the full code search pipeline that covers retrieval followed by reranking. It builds the data from counterfactually rewritten LiveCodeBench problems in five languages, released on a timed schedule with graded relevance judgments to limit contamination and noise. The authors test eleven embedding models and five rerankers on the three tasks and show that code-specialized embeddings help on code-to-code but no model wins everything, short keyword queries collapse performance for all, and off-the-shelf rerankers swing widely by task with no net benefit. Their fine-tuned CoREB-Reranker is the first to improve results across every task. This matters because production code search uses these combined stages, yet benchmarks have long focused only on initial retrieval.

Core claim

The central claim is that a reranker fine-tuned on the CoREB multitask data becomes the first model to achieve consistent gains across text-to-code, code-to-text, and code-to-code tasks, while code-specialized embeddings dominate only code-to-code retrieval and off-the-shelf rerankers remain task-asymmetric with no baseline that helps on all three.

What carries the argument

The CoREB benchmark, built from counterfactually rewritten LiveCodeBench problems with timed releases and graded relevance judgments, together with the CoREB-Reranker fine-tuned on its multitask retrieval and reranking data.

If this is right

  • Code-specialized embedding models achieve roughly twice the performance of general encoders on code-to-code retrieval.
  • Short keyword queries, the format closest to actual developer use, drive every model to near-zero nDCG@10.
  • Off-the-shelf rerankers exhibit large task asymmetry, with up to 12-point swings and no net-positive result across all tasks.
  • Fine-tuning a reranker on the multitask benchmark overcomes the task-specific limitations seen in prior models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Timed benchmark releases could allow repeated testing while tracking contamination over time.
  • Counterfactual rewriting of problems might be used to create cleaner benchmarks for other retrieval or generation tasks.
  • Embedding multitask rerankers into search tools could raise the quality of code suggestions that developers actually encounter.

Load-bearing premise

The counterfactually rewritten problems produce graded relevance judgments that stay free of label noise and match original developer intent without new biases.

What would settle it

A new model or reranker that shows no improvement over strong baselines when tested on independently collected real developer code search queries with fresh relevance labels would falsify the claim of consistent gains.

Figures

Figures reproduced from arXiv: 2605.04615 by Fan Zhou, Hang Yu, Jin Qin, Siqiao Xue, Yixiang Mu, Zihan Liao, Ziyin Zhang.

Figure 1
Figure 1. Figure 1: Aggregated query distribution across both C view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark construction pipeline. Each step is detailed in the corresponding paragraph below. view at source ↗
Figure 3
Figure 3. Figure 3: Pass@1 change (pp) after rewriting for Gemini 3 Flash across two releases cover￾ing different contest windows. Step 2: Counterfactual rewriting and code gener￾ation. We apply light counterfactual rewriting (Wu et al., 2024) to each problem’s statement and test cases by modifying named entities, variable names, narrative framing, and I/O examples while preserving the formal specification and algorithmic str… view at source ↗
Figure 2
Figure 2. Figure 2: Benchmark construction pipeline. Each step is detailed in the corresponding paragraph below. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: nDCG@k at k∈ {1, 5, 10} per task, av￾eraged over all eleven models on v202603. Code￾to-text saturates early; text-to-code and code-to￾code grow more steeply with k. MODEL CANONICAL FULL SEARCH GEMEMB-2 0.573 0.565 0.000 C2LLM-7B 0.582 0.578 0.004 C2LLM-0.5B 0.566 0.560 0.003 F2LLM-4B 0.529 0.535 0.007 JINA-CODE-1.5B 0.539 0.543 0.006 F2LLM-1.7B 0.481 0.520 0.008 QWEN3-4B 0.510 0.504 0.015 QWEN3-0.6B 0.452 … view at source ↗
Figure 6
Figure 6. Figure 6: Text-to-code nDCG@10 by target language on v202603 (excluding Search subtask). Full per-subtask and per-language tables are in Appendix B.6. Analysis V: Do hard negatives provide additional evaluation signal? A key design choice in COREB is the explicit inclusion of same-problem hard negatives (relevance=1) in the qrels. Un￾like benchmarks with binary or absent-is-irrelevant qrels, our graded scheme expose… view at source ↗
Figure 7
Figure 7. Figure 7: Hard-negative intrusion rate: fraction of queries where at least one hard negative ranks above the best view at source ↗
Figure 8
Figure 8. Figure 8: ∆ nDCG@10 (%) after reranking (k= 128) on top of C2LLM-7B. No baseline is net-positive across all three tasks; only our fine-tuned COREB￾RERANKER achieves this. Main results on reranking. We rerank the top-128 candidates retrieved by C2LLM-7B (the strongest open-weight retriever) with four baseline rerankers: Jina Reranker v2, Jina Reranker v3, Qwen3-Reranker-0.6B, and Qwen3-Reranker-4B view at source ↗
Figure 9
Figure 9. Figure 9: Overall nDCG@10 vs. parameter count (log scale) for the ten open-weight models. Circles = code-specialized; squares = general-purpose. The dashed line marks the Pareto frontier. GemEmb-2 is excluded due to its unknown parameter size. 0.0 0.5 1.0 1.5 2.0 Overall nDCG@10 / billion parameters C2LLM-0.5B Jina-code-0.5B Qwen3-0.6B F2LLM-0.6B F2LLM-4B Qwen3-4B C2LLM-7B Qwen3-8B 1.21 1.19 0.74 0.73 0.14 0.12 0.09… view at source ↗
Figure 11
Figure 11. Figure 11: Query distribution per COREB release: v202602 (2,604 queries, top row) and v202603 (2,483 queries, bottom row). The two releases exhibit nearly identical proportions; see view at source ↗
Figure 12
Figure 12. Figure 12: Code-to-text nDCG@10 by subtask type on v202603. How do models handle short keyword queries? The text-to-code Search subtask uses short developer-style queries (19 tokens on average). On this subtask every model collapses to near￾zero nDCG@10, two orders of magnitude below the Canonical subtask ( view at source ↗
Figure 13
Figure 13. Figure 13: Full model parameter efficiency (expanded version of Figures view at source ↗
Figure 14
Figure 14. Figure 14: Recall@ view at source ↗
read the original abstract

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CoREB, a contamination-limited multitask benchmark for code retrieval and reranking constructed from counterfactually rewritten LiveCodeBench problems across five languages with graded relevance judgments. It evaluates 11 embedding models and 5 rerankers on text-to-code, code-to-text, and code-to-code tasks, highlighting asymmetries in model performance (e.g., code-specialized embeddings excelling on code-to-code but no universal winner) and query format effects, while proposing a fine-tuned CoREB-Reranker that achieves the first consistent nDCG@10 gains across all three tasks. Data and model are released.

Significance. If the benchmark construction and graded labels prove robust, this advances code search evaluation by moving beyond first-stage retrieval to full pipelines, addressing data contamination, binary relevance, and label noise in prior benchmarks. The multitask design, empirical findings on short-keyword query collapse and reranker task-asymmetry, plus artifact release, provide a useful foundation for more realistic code search research.

major comments (3)
  1. [§3] §3 (Benchmark Construction): The counterfactual rewriting of LiveCodeBench problems to generate graded relevance judgments lacks explicit validation (e.g., human review of intent preservation or semantic equivalence checks). This is load-bearing for all reported nDCG@10 results and the headline claim of consistent reranker gains, as any intent drift or label noise from edits to control flow, variables, or edge cases could artifactually inflate the observed 12-point swings.
  2. [§5.3–5.4] §5.3–5.4 (Experiments and Results): The claim that CoREB-Reranker is the first to achieve net-positive gains across all three tasks is presented without statistical significance tests, confidence intervals, or multiple-run variance for the nDCG@10 differences. This weakens attribution of improvements to the model rather than benchmark-specific effects.
  3. [Table 3] Table 3 (or equivalent performance tables): No ablation compares results on original LiveCodeBench problems versus the rewritten versions, making it impossible to isolate whether the multitask gains stem from the reranker or from properties introduced by the rewriting process itself.
minor comments (2)
  1. [Abstract / §1] The abstract and §1 should clarify the exact timing and versioning mechanism for the 'timed releases' to support reproducibility claims.
  2. [§3.2] Notation for graded relevance (e.g., how 0–3 or similar scales map to nDCG computation) could be stated more explicitly in §3.2 to avoid ambiguity in replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will incorporate revisions to strengthen the manuscript's claims regarding benchmark validity, statistical rigor, and ablation analysis.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The counterfactual rewriting of LiveCodeBench problems to generate graded relevance judgments lacks explicit validation (e.g., human review of intent preservation or semantic equivalence checks). This is load-bearing for all reported nDCG@10 results and the headline claim of consistent reranker gains, as any intent drift or label noise from edits to control flow, variables, or edge cases could artifactually inflate the observed 12-point swings.

    Authors: We appreciate this observation on the robustness of our benchmark construction. The rewriting process in §3 followed a structured protocol to preserve core problem intent, functionality, and semantics while introducing controlled counterfactual variations (e.g., variable renaming and control-flow adjustments that do not alter expected outputs). However, we acknowledge that a formal human validation study was not reported in the initial submission. In the revised manuscript, we will expand §3 with the full rewriting guidelines and include results from a targeted human evaluation (on a sample of problems across languages) confirming intent preservation and low label noise. This directly addresses concerns about potential artifacts in the nDCG@10 gains. revision: yes

  2. Referee: [§5.3–5.4] §5.3–5.4 (Experiments and Results): The claim that CoREB-Reranker is the first to achieve net-positive gains across all three tasks is presented without statistical significance tests, confidence intervals, or multiple-run variance for the nDCG@10 differences. This weakens attribution of improvements to the model rather than benchmark-specific effects.

    Authors: We agree that the absence of statistical analysis limits the strength of our attribution claims. The original experiments used single-run evaluations on the fixed CoREB splits. In the revised version, we will add bootstrap confidence intervals (with 1000 resamples) for all nDCG@10 scores and apply paired non-parametric tests (Wilcoxon signed-rank) to evaluate whether CoREB-Reranker's improvements over baselines are statistically significant across the three tasks. This will better isolate model contributions from benchmark-specific variance. revision: yes

  3. Referee: [Table 3] Table 3 (or equivalent performance tables): No ablation compares results on original LiveCodeBench problems versus the rewritten versions, making it impossible to isolate whether the multitask gains stem from the reranker or from properties introduced by the rewriting process itself.

    Authors: This is a valid methodological concern. While CoREB is designed as a contamination-limited benchmark with timed releases, we did not include a direct comparison to the original LiveCodeBench problems in the submitted manuscript. We will add this ablation in the revised paper: we will evaluate the same embedding models and rerankers on the subset of original LiveCodeBench problems that overlap with our rewritten set (where feasible) and report the delta in nDCG@10. This will help clarify the contribution of the rewriting process versus the reranker itself. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark and model evaluation on released data

full rationale

The paper constructs CoREB from counterfactually rewritten LiveCodeBench problems, releases the data and fine-tuned CoREB-Reranker, then reports empirical nDCG@10 results across three tasks for eleven embeddings and five rerankers. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. The central claim of consistent gains is an empirical observation on the new benchmark rather than a reduction to prior inputs by construction. The study is self-contained against external benchmarks and released artifacts.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are required beyond standard assumptions in machine learning benchmarking and fine-tuning.

pith-pipeline@v0.9.0 · 5586 in / 1072 out tokens · 34325 ms · 2026-05-11T01:21:39.301354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our experiments reveal that: ① code-specialised embeddings dominate code-to-code retrieval (~2× over general encoders), yet no single model wins all three tasks; ② short keyword queries... collapse every model to near-zero nDCG@10; ③ off-the-shelf rerankers are task-asymmetric...; ④ our fine-tuned CoREB-Reranker is the first to achieve consistent gains across all three tasks.

  • IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    COREB is built from counterfactually rewritten LiveCodeBench problems... with graded relevance judgments... relevance=2 to true positives, relevance=1 to same-problem hard negatives

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages

  1. [1]

    Provide ONLY executable source code−−no comments, no markdown, no explanations

  2. [2]

    Implement a ‘main()‘ function that: −reads all input from stdin, −computes the answer, −writes the result to stdout

  3. [3]

    Add any necessary helper functions (without comments)

  4. [4]

    Call ‘main()‘ at the bottom of the file

  5. [5]

    Alice wants to sort her books

    Do not modify the supplied starter skeleton: {{starter code}} Output Format −−−−−−−−−−−−− Wrap the final program in XML−style tags: <code> ...your code... </code> Generate the COMPLETE solution. Do not stop mid−function. Listing 1: Prompt template for code generation. Counterfactual Rewriting.To reduce surface-level contamination, each problem is passed t...

  6. [6]

    [1, 2, 3]−>6

    For purely numerical test cases (containing only numbers, basic operators, and data structures): −DO NOT MODIFY them at all−keep them exactly as they are −Example: Leave "[1, 2, 3]−>6" or "5 + 10 = 15" unchanged

  7. [7]

    count books on shelf

    For non−numerical test cases (containing domain−specific terms): −Make MINIMAL changes necessary to match your transformed problem context −PRESERVE the exact same algorithmic structure and complexity −Maintain the same input/output patterns and edge cases −Example: If you changed "count books on shelf" to "count tools in box", then "books=[’novel’,’textb...

  8. [8]

    annotate question title

    ALL test cases must: −Remain syntactically correct in the target language −Test exactly the same edge cases and functionality −Have the same expected outputs for equivalent inputs ## Your Counterfactual Version Given the original problem: Title:{{question title}} Content: {{question content}} Starter Code: {{starter code}} 15 Public Test Cases: {{public t...

  9. [9]

    remove / delete

    Operation-type change. Alter the fundamental operation the algorithm must perform (e.g., re- place a “remove / delete” goal with a “select / construct / merge” goal) so that the core algorithmic action is different, not merely renamed

  10. [10]

    maxi- mize

    Optimization-objective change. Invert or replace the optimization criterion (e.g., change “maxi- mize” to “minimize” or “count distinct”) so that what the algorithm optimizes for changes struc- turally

  11. [11]

    Algorithmic-approach change. Replace the algorithmic paradigm required to solve the problem (e.g., subsequence reasoning→contiguous-array or graph problems; greedy→dynamic pro- gramming; two-pointer→binary search) so that the required solution strategy is qualitatively different

  12. [12]

    Modified Problem Description:

    Problem-domain change. Alter input data types and the problem context (e.g., strings→graphs, arrays→trees), producing a structurally distinct problem that shares no obvious surface mapping to the original. Generated texts are post-processed with a regular-expression pass to strip LLM-produced markdown headers and formatting artifacts (e.g., “Modified Prob...

  13. [13]

    Avg. tok

    The maximum sequence length is 4,096 tokens, with right-side truncation and left padding. We train the reranker on 8 NVIDIA A100 GPUs. Checkpoint merging.The released COREB-RERANKERcheckpoint is theuniform model soup(Wortsman et al., 2022) of two independently fine-tuned LoRA variants. Both variants share the same base model, LoRA configuration, optimizer...

  14. [14]

    Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024)

    found that aggressive quality filtering retains only 25.6% of CodeSearchNet-style repository- scraped data, and reports 15–25% test/train near-duplication. Because CodeSearchNet has served as public training data since 2019, models evaluated on it face severe contamination risk (Allamanis, 2019; Hernandez Lopez et al., 2024). B.12.4 CodeSearchNet-CCR: Str...