Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
Pith reviewed 2026-05-20 08:07 UTC · model grok-4.3
The pith
Replacing the logarithm in BM25's RSJ odds with a tunable q-logarithm lifts code retrieval NDCG@10 by 89 percent under fixed generic tokenization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a one-parameter q-log transform of the Robertson-Spärck-Jones odds, inserted in place of the natural logarithm inside BM25, corrects under-separation of identifier tails when tokenization cannot be changed. At q=1 the formula recovers classical BM25 by L'Hôpital's rule; for q less than 1 it functions as a Box-Cox reweighting of the odds. The resulting scorer is a drop-in replacement that requires only a single pass over the sparse index at build time and leaves query latency unchanged.
What carries the argument
The q-logarithm of the Robertson-Spärck-Jones odds, which recovers BM25 exactly at q=1 and stretches rare-term separation for q less than 1.
If this is right
- NDCG@10 rises by 89 percent relative on CoIR CodeSearchNet Go with zero sign reversals in 10,000 paired bootstrap resamples.
- The lift is graded across code languages and near zero on BEIR text collections.
- A closed-form estimate of q from hapax density keeps the method parameter-light and near q=1 on text where BM25 already works well.
- Identifier-aware tokenization removes most of the incremental gain, confirming the method targets tokenizer limitations.
- Index build adds one pass over the sparse matrix while query latency stays identical to BM25.
Where Pith is reading between the lines
- The same reweighting could be tested on other sparse lexical retrievers that rely on logarithmic IDF.
- Domains with long-tailed rare distinguishing terms, such as legal or biomedical search, may show analogous gains under fixed tokenizers.
- The approach suggests that the natural logarithm is not universally optimal for term weighting and invites similar Box-Cox-style adjustments in other ranking formulas.
- Because the method is a post-hoc score transform, it can be combined with existing BM25 indexes without re-indexing.
Load-bearing premise
BM25's logarithmic RSJ-odds under-separates the identifier tail that distinguishes one function from another when tokenization is generic and frozen.
What would settle it
Running the q-tuned scorer against a code corpus with the same frozen generic tokenizer and finding zero or negative change in NDCG@10 or MAP would falsify the central claim.
Figures
read the original abstract
In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Sp\"arck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'H\^opital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR text. A one-parameter closed form estimates a corpus-level q from hapax density and stays near q=1 on corpora where BM25 is already optimal. The index-time cost is a single pass over the sparse score matrix and query latency is unchanged. A tokenizer ablation shows that identifier-aware tokenization largely removes the incremental gain from q-IDF.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes replacing the outer logarithm in BM25's Robertson-Spärck-Jones odds IDF with a q-logarithm (recovering standard BM25 at q=1 via L'Hôpital's rule) as a drop-in fix for code retrieval under frozen generic tokenization. On CoIR CodeSearchNet Go (182K docs), oracle-tuned NDCG@10 improves from 0.2575 to 0.4874 (+89.3% relative) with zero sign reversals in 10k paired bootstrap resamples (p ≤ 10^{-4}); a closed-form hapax-density estimator for corpus-level q is introduced, with near-zero gains on BEIR text and ablation evidence that identifier-aware tokenization removes the benefit.
Significance. If the hapax-density estimator yields comparable gains without labels or oracle tuning, the method would be a low-cost, latency-neutral improvement to BM25 for code search under fixed tokenizers, supported by the bootstrap validation and graded language effects. The parameter-free grounding of q via hapax density and the tokenizer ablation are strengths that tie the result to the stated premise about identifier-tail separation.
major comments (2)
- [Abstract and experimental results] Abstract and results: the reported NDCG@10 gain of +0.2299 (0.2575 → 0.4874) and associated bootstrap significance apply only to oracle-tuned q that maximizes the test metric; the magnitude of improvement (or lack thereof) achieved by the closed-form hapax-density estimator is not stated, which is load-bearing for the central claim of a practical drop-in BM25 fix under fixed generic tokenization.
- [Tokenizer ablation] Tokenizer ablation: while the ablation shows identifier-aware tokenization largely removes the q-IDF gain, the section does not report the interaction term or per-language delta between generic and identifier-aware settings with the estimator q, leaving the premise about under-separation of the identifier tail only partially quantified.
minor comments (2)
- [Methods] Clarify in the methods whether the q-log is applied only to the IDF term or also to the TF component, and confirm that query latency remains exactly unchanged after the single-pass index adjustment.
- [Results tables] Tables reporting cross-language results should include both oracle q and the hapax-estimated q side-by-side to allow direct assessment of the estimator's effectiveness.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract and experimental results] Abstract and results: the reported NDCG@10 gain of +0.2299 (0.2575 → 0.4874) and associated bootstrap significance apply only to oracle-tuned q that maximizes the test metric; the magnitude of improvement (or lack thereof) achieved by the closed-form hapax-density estimator is not stated, which is load-bearing for the central claim of a practical drop-in BM25 fix under fixed generic tokenization.
Authors: We agree that the performance of the closed-form hapax-density estimator on the primary evaluation task is central to the practical claim and should be reported explicitly. The manuscript introduces the estimator and notes that it stays near q=1 on corpora where BM25 is optimal, but does not provide the corresponding NDCG@10 value for CodeSearchNet Go. We will revise the experimental results section to include the NDCG@10 achieved using the estimated q, along with bootstrap significance if applicable, to allow direct comparison with the oracle-tuned results. revision: yes
-
Referee: [Tokenizer ablation] Tokenizer ablation: while the ablation shows identifier-aware tokenization largely removes the q-IDF gain, the section does not report the interaction term or per-language delta between generic and identifier-aware settings with the estimator q, leaving the premise about under-separation of the identifier tail only partially quantified.
Authors: We concur that quantifying the interaction between tokenization type and the use of the estimator q, as well as per-language deltas, would provide a more complete picture of the effect. We will update the tokenizer ablation section to include these metrics for the estimator q in addition to the oracle q, thereby more fully supporting the premise regarding identifier tail separation under generic tokenization. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper replaces the outer logarithm in BM25's RSJ-odds with a q-logarithm, recovering standard BM25 exactly at q=1 via L'Hôpital's rule and framing q<1 as a Box-Cox transform. This is a direct mathematical generalization with no definitional loop. The adaptive q is supplied by an explicit closed-form estimator computed from hapax density (a corpus-level statistic independent of any retrieval metric or label). Oracle-tuned NDCG numbers are presented as an upper-bound demonstration rather than a claimed prediction; the estimator itself is described as the practical drop-in mechanism and is stated to remain near 1 where BM25 already performs well. No self-citations appear in the load-bearing steps, no uniqueness theorems are imported from prior author work, and no ansatz is smuggled via citation. The central empirical contrast (large gains on code, near-zero on BEIR text) is externally falsifiable and does not reduce to the inputs by construction. The derivation is therefore self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- q
axioms (1)
- standard math The q-logarithm recovers the standard logarithm at q=1 via L'Hôpital's rule
Reference graph
Works this paper leans on
-
[1]
doi: 10.1145/582415.582416. Egor Bogomolov, Aleksandra Eliseeva, Timur Galimzyanov, Evgeniy Glukhov, Anton Shapkin, Maria Tigina, Yaroslav Golubev, Alexander Kovrigin, Arie van Deursen, Maliheh Izadi, and Tim- ofey Bryksin. Long code arena: a set of benchmarks for long-context code models. arXiv preprint arXiv:2406.11612,
-
[2]
doi: 10.1111/j.2517-6161.1964.tb00553.x. B. Caprile and P. Tonella. Restructuring program identifier names. InProceedings of the IEEE International Conference on Software Maintenance (ICSM 2000), pages 97–107,
-
[3]
1109/ICSM.2000.883022. Yangruibo Ding, Zijian Wang, Wasi Uddin Ahmad, Hantian Ding, Ming Tan, Nihal Jain, Murali Kr- ishna Ramanathan, Ramesh Nallapati, Parminder Bhatia, Dan Roth, and Bing Xiang. Cross- CodeEval: A diverse and multilingual benchmark for cross-file code completion. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023),
-
[4]
doi: 10.48550/arXiv.2310. 11248. URLhttps://openreview.net/forum?id=wgDcbBMSfh. Eric Enslen, Emily Hill, Lori Pollock, and K. Vijay-Shanker. Mining source code to automatically split identifiers for software analysis. In6th IEEE International Working Conference on Mining Software Repositories (MSR 2009), pages 71–80,
-
[5]
Emily Hill, Dawn Lawrie, Lori Pollock, and K
doi: 10.1109/MSR.2009.5069482. Emily Hill, Dawn Lawrie, Lori Pollock, and K. Vijay-Shanker. An empirical study of identifier splitting techniques.Empirical Software Engineering, 19(6):1754–1780,
-
[6]
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search
doi: 10.48550/arXiv.1909.09436. Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Transactions on Information Systems, 20(4):422–446,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1909.09436 1909
-
[7]
doi: 10.1145/582415.582418. 16 Karen Spärck Jones. A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation, 28(1):11–21,
-
[8]
doi: 10.1108/eb026526. Chris Kamphuis, Arjen P. de Vries, Leonid Boytsov, and Jimmy Lin. Which BM25 do you mean? a large-scale reproducibility study of scoring variants. InAdvances in Information Retrieval, volume 12035 ofLecture Notes in Computer Science, pages 28–34. Springer,
-
[9]
Tianyang Liu, Canwen Xu, and Julian McAuley
doi: 10.48550/arXiv.2407.02883. Tianyang Liu, Canwen Xu, and Julian McAuley. RepoBench: Benchmarking repository-level code auto-completion systems. InThe Twelfth International Conference on Learning Representations (ICLR 2024),
-
[10]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
doi: 10.48550/arXiv.2306.03091. Panagiotis Louridas, Diomidis Spinellis, and Vasileios Vlachos. Power laws in software.ACM Transactions on Software Engineering and Methodology, 18(1):1–26,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03091
-
[11]
doi: 10.1145/ 1391984.1391986. Xing Han Lu. BM25S: Orders of magnitude faster lexical search via eager sparse scoring.arXiv preprint,
-
[12]
Bm25s: Orders of magnitude faster lexical search via eager sparse scoring, 2024
doi: 10.48550/arXiv.2407.03618. Santosh Kumar Radha and Oktay Goktas. RareCode: Code and artifacts for adaptiveq-log bm25 code retrieval.https://github.com/santoshkumarradha/rarecode,
-
[13]
Robertson and Hugo Zaragoza , title =
doi: 10.1561/1500000019. Stephen E. Robertson, Steve Walker, Susan Jones, Micheline M. Hancock-Beaulieu, and Mike Gat- ford. Okapi at TREC-3. InProceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, number 500-225 in NIST Special Pub- lication, pages 109–126. National Institute of Standards and Tech...
-
[14]
URL https://trec.nist.gov/pubs/trec3/papers/city.ps.gz. Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. InProceedings of the Sixteenth ACM Conference on Infor- mation and Knowledge Management (CIKM 2007), pages 623–632,
work page 2007
-
[15]
doi: 10.1145/1321440. 1321528. Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InProceed- ings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS 2021),
-
[16]
BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models
doi: 10.48550/arXiv.2104.08663. Constantino Tsallis. Possible generalization of Boltzmann-Gibbs statistics.Journal of Statistical Physics, 52(1–2):479–487,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2104.08663
-
[17]
Possible Generalization of Boltzmann-Gibbs Statistics
doi: 10.1007/BF01016429. Constantino Tsallis.Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World. Springer, New York,
-
[18]
Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World
doi: 10.1007/978-0-387-85359-8. Hongyu Zhang. Discovering power laws in computer programs.Information Processing & Man- agement, 45(4):477–483,
-
[19]
doi: 10.1016/j.ipm.2009.02.002. A Systems Overhead Theq-log method rescales the per-term IDF vector once at index-load time and leaves the query path unchanged. We measure the end-to-end overhead on CoIR-Go (182,440 documents,V= 144,938 vocabulary tokens) across5trials of1,000queries with top-100retrieval; values reported as median ±MAD withgc.disable(). ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.