pith. sign in

arxiv: 2605.18561 · v1 · pith:ZKY56IMOnew · submitted 2026-05-18 · 💻 cs.IR · cs.AI· cs.SE

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

Pith reviewed 2026-05-20 08:07 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.SE
keywords BM25code retrievalq-logarithmRSJ oddsinformation retrievaltokenizationNDCGsparse retrieval
0
0 comments X

The pith

Replacing the logarithm in BM25's RSJ odds with a tunable q-logarithm lifts code retrieval NDCG@10 by 89 percent under fixed generic tokenization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard BM25 underperforms in code search when the tokenizer is frozen and generic because its logarithmic term weighting fails to separate distinctive rare identifiers across functions. The fix replaces the outer log in the Robertson-Spärck-Jones odds with a q-logarithm that reduces to BM25 at q equals 1 and stretches the tail for q less than 1. On a 182-thousand-document Go code corpus the change raises oracle-tuned NDCG@10 from 0.2575 to 0.4874 with no sign reversals across bootstrap trials. The gain shrinks on other code languages and vanishes on ordinary text collections. A single-pass corpus statistic based on hapax density selects q without per-query tuning and adds no query-time cost.

Core claim

The central claim is that a one-parameter q-log transform of the Robertson-Spärck-Jones odds, inserted in place of the natural logarithm inside BM25, corrects under-separation of identifier tails when tokenization cannot be changed. At q=1 the formula recovers classical BM25 by L'Hôpital's rule; for q less than 1 it functions as a Box-Cox reweighting of the odds. The resulting scorer is a drop-in replacement that requires only a single pass over the sparse index at build time and leaves query latency unchanged.

What carries the argument

The q-logarithm of the Robertson-Spärck-Jones odds, which recovers BM25 exactly at q=1 and stretches rare-term separation for q less than 1.

If this is right

  • NDCG@10 rises by 89 percent relative on CoIR CodeSearchNet Go with zero sign reversals in 10,000 paired bootstrap resamples.
  • The lift is graded across code languages and near zero on BEIR text collections.
  • A closed-form estimate of q from hapax density keeps the method parameter-light and near q=1 on text where BM25 already works well.
  • Identifier-aware tokenization removes most of the incremental gain, confirming the method targets tokenizer limitations.
  • Index build adds one pass over the sparse matrix while query latency stays identical to BM25.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reweighting could be tested on other sparse lexical retrievers that rely on logarithmic IDF.
  • Domains with long-tailed rare distinguishing terms, such as legal or biomedical search, may show analogous gains under fixed tokenizers.
  • The approach suggests that the natural logarithm is not universally optimal for term weighting and invites similar Box-Cox-style adjustments in other ranking formulas.
  • Because the method is a post-hoc score transform, it can be combined with existing BM25 indexes without re-indexing.

Load-bearing premise

BM25's logarithmic RSJ-odds under-separates the identifier tail that distinguishes one function from another when tokenization is generic and frozen.

What would settle it

Running the q-tuned scorer against a code corpus with the same frozen generic tokenizer and finding zero or negative change in NDCG@10 or MAP would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.18561 by Oktay Goktas, Santosh Kumar Radha.

Figure 1
Figure 1. Figure 1: The q-log deformation of RSJ odds. Each curve is idfRSJ q (t) = lnq((N − nt + 0.5)/(nt + 0.5)) with lnq(x) = (x 1−q − 1)/(1 − q), plotted against the document-frequency fraction nt/N. At q=1 (dashed gray) the expression recovers the classical RSJ log-IDF exactly via L’Hôpital. For q < 1 the IDF amplifies ultra-rare tokens into a power-law tail x 1−q/(1 − q); for q > 1 it saturates at 1/(q − 1) and the rare… view at source ↗
Figure 2
Figure 2. Figure 2: q-log sweep on CoIR-CSN Go under the q-log RSJ IDF, idfRSJ q (t) = lnq [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A qualitative walkthrough of the mechanism on a CoIR-Go full-corpus query. Under [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-query mechanism of q-log BM25 on CoIR-CodeSearchNet at each corpus’s working point: Go full 182K at q=0.10 (blue, N=8,122 queries; mean ∆NDCG@10 = +0.23) and Python full 280K at q=1.00 (orange, N=14,918 queries; ∆ ≡ 0 by construction — q-log BM25 is identi￾cal to BM25 at q=1, which we use as a null control). Panels (a)–(c) scatter per-query ∆NDCG@10 against three features of the query: low-df mass (fra… view at source ↗
Figure 5
Figure 5. Figure 5: The q-predictor, two views. (a) Leave-one-language-out: for each of the six code lan￾guages, the form q = 1 − c htok is fit on the other five languages at all three corpus sizes and applied to the held-out language’s three sizes; grand-mean recovery 0.72. Shaded band is ±1.96σ from the eighteen LOLO residuals (σ = 0.183). (b) Final fit on all eighteen points, qpred = 1 − 7.28 htok (clipped to [0.01, 1]), w… view at source ↗
Figure 6
Figure 6. Figure 6: Adaptive q-log gate on the CoIR-CSN Python scaling curve. BM25 (q=1) loses NDCG@10 monotonically as the corpus grows from 1k to 280k docs. The naive static pick q=0.7 (argmax on the 5k subset sweep) helps at small sizes but collapses to −4.6% at full scale where the oracle argmax has moved back to q=1. The deployed one-parameter adaptive predictor qpred = 1 − 7.28 htok of eq. (4) tracks the oracle: it beat… view at source ↗
Figure 7
Figure 7. Figure 7: Tokenizer ablation across CoIR CSN languages (Go, Python, Java, Ruby; 50K docs per [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A practitioner decision chart. When the analyzer is under the operator’s control, the first [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recall@K-tokens on CoIR CodeSearchNet full-corpus splits under frozen generic tok￾enization. Each language is ranked with BM25 (gray) and the q-log BM25 method (blue) at the shown operating point (Go uses the conservative q=0.10 setting); we walk the ranked list top-down and mark a query as recalled at budget K iff the cumulative tiktoken (cl100k_base) count through the gold document is ≤ K. Note the diffe… view at source ↗
Figure 10
Figure 10. Figure 10: RepoBench-R NDCG@10 on the four “{Python, Java} [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Systems overhead on CoIR-Go (182,440 docs, V=144,938). q-log BM25 adds a single O(|V | + nnz) sparse-matrix pass at index time (median 117.7 ms, MAD 1.4 ms over 5 trials). Index-build wall clock is within +7.06% of BM25. Query latency (top-100, 1000 queries) is within measurement noise (p50 ∆ = +0.18%, p95 ∆ = +2.29%). q-log BM25 and BM25 share the exact same CSC scoring code path at query time—the only c… view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of mean-test recovery across all [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Sp\"arck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'H\^opital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR text. A one-parameter closed form estimates a corpus-level q from hapax density and stays near q=1 on corpora where BM25 is already optimal. The index-time cost is a single pass over the sparse score matrix and query latency is unchanged. A tokenizer ablation shows that identifier-aware tokenization largely removes the incremental gain from q-IDF.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes replacing the outer logarithm in BM25's Robertson-Spärck-Jones odds IDF with a q-logarithm (recovering standard BM25 at q=1 via L'Hôpital's rule) as a drop-in fix for code retrieval under frozen generic tokenization. On CoIR CodeSearchNet Go (182K docs), oracle-tuned NDCG@10 improves from 0.2575 to 0.4874 (+89.3% relative) with zero sign reversals in 10k paired bootstrap resamples (p ≤ 10^{-4}); a closed-form hapax-density estimator for corpus-level q is introduced, with near-zero gains on BEIR text and ablation evidence that identifier-aware tokenization removes the benefit.

Significance. If the hapax-density estimator yields comparable gains without labels or oracle tuning, the method would be a low-cost, latency-neutral improvement to BM25 for code search under fixed tokenizers, supported by the bootstrap validation and graded language effects. The parameter-free grounding of q via hapax density and the tokenizer ablation are strengths that tie the result to the stated premise about identifier-tail separation.

major comments (2)
  1. [Abstract and experimental results] Abstract and results: the reported NDCG@10 gain of +0.2299 (0.2575 → 0.4874) and associated bootstrap significance apply only to oracle-tuned q that maximizes the test metric; the magnitude of improvement (or lack thereof) achieved by the closed-form hapax-density estimator is not stated, which is load-bearing for the central claim of a practical drop-in BM25 fix under fixed generic tokenization.
  2. [Tokenizer ablation] Tokenizer ablation: while the ablation shows identifier-aware tokenization largely removes the q-IDF gain, the section does not report the interaction term or per-language delta between generic and identifier-aware settings with the estimator q, leaving the premise about under-separation of the identifier tail only partially quantified.
minor comments (2)
  1. [Methods] Clarify in the methods whether the q-log is applied only to the IDF term or also to the TF component, and confirm that query latency remains exactly unchanged after the single-pass index adjustment.
  2. [Results tables] Tables reporting cross-language results should include both oracle q and the hapax-estimated q side-by-side to allow direct assessment of the estimator's effectiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments point by point below, indicating where revisions will be made to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and results: the reported NDCG@10 gain of +0.2299 (0.2575 → 0.4874) and associated bootstrap significance apply only to oracle-tuned q that maximizes the test metric; the magnitude of improvement (or lack thereof) achieved by the closed-form hapax-density estimator is not stated, which is load-bearing for the central claim of a practical drop-in BM25 fix under fixed generic tokenization.

    Authors: We agree that the performance of the closed-form hapax-density estimator on the primary evaluation task is central to the practical claim and should be reported explicitly. The manuscript introduces the estimator and notes that it stays near q=1 on corpora where BM25 is optimal, but does not provide the corresponding NDCG@10 value for CodeSearchNet Go. We will revise the experimental results section to include the NDCG@10 achieved using the estimated q, along with bootstrap significance if applicable, to allow direct comparison with the oracle-tuned results. revision: yes

  2. Referee: [Tokenizer ablation] Tokenizer ablation: while the ablation shows identifier-aware tokenization largely removes the q-IDF gain, the section does not report the interaction term or per-language delta between generic and identifier-aware settings with the estimator q, leaving the premise about under-separation of the identifier tail only partially quantified.

    Authors: We concur that quantifying the interaction between tokenization type and the use of the estimator q, as well as per-language deltas, would provide a more complete picture of the effect. We will update the tokenizer ablation section to include these metrics for the estimator q in addition to the oracle q, thereby more fully supporting the premise regarding identifier tail separation under generic tokenization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper replaces the outer logarithm in BM25's RSJ-odds with a q-logarithm, recovering standard BM25 exactly at q=1 via L'Hôpital's rule and framing q<1 as a Box-Cox transform. This is a direct mathematical generalization with no definitional loop. The adaptive q is supplied by an explicit closed-form estimator computed from hapax density (a corpus-level statistic independent of any retrieval metric or label). Oracle-tuned NDCG numbers are presented as an upper-bound demonstration rather than a claimed prediction; the estimator itself is described as the practical drop-in mechanism and is stated to remain near 1 where BM25 already performs well. No self-citations appear in the load-bearing steps, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled via citation. The central empirical contrast (large gains on code, near-zero on BEIR text) is externally falsifiable and does not reduce to the inputs by construction. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical validity of the q-log transform as a generalization of the logarithm and the empirical observation that hapax density can be used to estimate an effective q for code corpora.

free parameters (1)
  • q
    The parameter in the q-logarithm, estimated via closed form from hapax density rather than directly fitted to the retrieval performance metric.
axioms (1)
  • standard math The q-logarithm recovers the standard logarithm at q=1 via L'Hôpital's rule
    Invoked to show that q=1 is exactly BM25.

pith-pipeline@v0.9.0 · 5807 in / 1517 out tokens · 60564 ms · 2026-05-20T08:07:36.739877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 3 internal anchors

  1. [1]

    doi: 10.1145/582415.582416. Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, and Tim- ofey Bryksin. Long code arena: a set of benchmarks for long-context code models. arXiv preprint arXiv:2406.11612,

  2. [2]

    doi: 10.1111/j.2517-6161.1964.tb00553.x. B. Caprile and P. Tonella. Restructuring program identifier names. InProceedings of the IEEE International Conference on Software Maintenance (ICSM 2000), pages 97–107,

  3. [3]

    Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang

    1109/ICSM.2000.883022. Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cross- CodeEval: A diverse and multilingual benchmark for cross-file code completion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),

  4. [4]

    doi: 10.48550/arXiv.2310. 11248. URLhttps://openreview.net/forum?id=wgDcbBMSfh. Eric Enslen, Emily Hill, Lori Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. In6th IEEE International Working Conference on Mining Software Repositories (MSR 2009), pages 71–80,

  5. [5]

    Emily Hill, Dawn Lawrie, Lori Pollock, and K

    doi: 10.1109/MSR.2009.5069482. Emily Hill, Dawn Lawrie, Lori Pollock, and K. Vijay-Shanker. An empirical study of identifier splitting techniques.Empirical Software Engineering, 19(6):1754–1780,

  6. [6]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    doi: 10.48550/arXiv.1909.09436. Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems, 20(4):422–446,

  7. [7]

    16 Karen Spärck Jones

    doi: 10.1145/582415.582418. 16 Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21,

  8. [8]

    Chris Kamphuis, Arjen P

    doi: 10.1108/eb026526. Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, and Jimmy Lin. Which BM25 do you mean? a large-scale reproducibility study of scoring variants. InAdvances in Information Retrieval, volume 12035 ofLecture Notes in Computer Science, pages 28–34. Springer,

  9. [9]

    Tianyang Liu, Canwen Xu, and Julian McAuley

    doi: 10.48550/arXiv.2407.02883. Tianyang Liu, Canwen Xu, and Julian McAuley. RepoBench: Benchmarking repository-level code auto-completion systems. InThe Twelfth International Conference on Learning Representations (ICLR 2024),

  10. [10]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    doi: 10.48550/arXiv.2306.03091. Panagiotis Louridas, Diomidis Spinellis, and Vasileios Vlachos. Power laws in software.ACM Transactions on Software Engineering and Methodology, 18(1):1–26,

  11. [11]

    Xing Han Lu

    doi: 10.1145/ 1391984.1391986. Xing Han Lu. BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint,

  12. [12]

    Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024

    doi: 10.48550/arXiv.2407.03618. Santosh Kumar Radha and Oktay Goktas. RareCode: Code and artifacts for adaptiveq-log bm25 code retrieval.https://github.com/santoshkumarradha/rarecode,

  13. [13]

    Robertson and Hugo Zaragoza , title =

    doi: 10.1561/1500000019. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gat- ford. Okapi at TREC-3. InProceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, number 500-225 in NIST Special Pub- lication, pages 109–126. National Institute of Standards and Tech...

  14. [14]

    URL https://trec.nist.gov/pubs/trec3/papers/city.ps.gz. Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. InProceedings of the Sixteenth ACM Conference on Infor- mation and Knowledge Management (CIKM 2007), pages 623–632,

  15. [15]

    doi: 10.1145/1321440. 1321528. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021),

  16. [16]

    BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

    doi: 10.48550/arXiv.2104.08663. Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics.Journal of Statistical Physics, 52(1–2):479–487,

  17. [17]

    Possible Generalization of Boltzmann-Gibbs Statistics

    doi: 10.1007/BF01016429. Constantino Tsallis.Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer, New York,

  18. [18]

    Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World

    doi: 10.1007/978-0-387-85359-8. Hongyu Zhang. Discovering power laws in computer programs.Information Processing & Man- agement, 45(4):477–483,

  19. [19]

    A Systems Overhead Theq-log method rescales the per-term IDF vector once at index-load time and leaves the query path unchanged

    doi: 10.1016/j.ipm.2009.02.002. A Systems Overhead Theq-log method rescales the per-term IDF vector once at index-load time and leaves the query path unchanged. We measure the end-to-end overhead on CoIR-Go (182,440 documents,V= 144,938 vocabulary tokens) across5trials of1,000queries with top-100retrieval; values reported as median ±MAD withgc.disable(). ...