Improving Search Suggestions for Alphanumeric Queries

Diptesh Kanojia; Jayanth Yetukuri; Qunzhi Zhou; Samarth Agrawal; Zhe Wu

arxiv: 2604.07364 · v1 · submitted 2026-04-01 · 💻 cs.IR

Improving Search Suggestions for Alphanumeric Queries

Samarth Agrawal , Jayanth Yetukuri , Diptesh Kanojia , Qunzhi Zhou , Zhe Wu This is my paper

Pith reviewed 2026-05-13 21:48 UTC · model grok-4.3

classification 💻 cs.IR

keywords alphanumeric searchbinary encodingHamming distancesearch suggestionscharacter level retrievale-commerce catalogstraining free methodidentifier matching

0 comments

The pith

Encoding alphanumeric sequences as fixed-length binary vectors allows efficient similarity search using Hamming distance for product suggestions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to solve the problem of searching alphanumeric identifiers like part numbers that are sensitive to small changes and don't work well with standard search techniques. It does this by turning each identifier into a binary vector based on its characters, then measuring similarity with Hamming distance for quick lookups. An edit distance step can be added to improve accuracy. This matters because it provides a simple, no-training-required way to build effective search suggestion systems for e-commerce catalogs where these codes are common.

Core claim

The authors claim that a fixed-length binary vector representation of alphanumeric sequences, combined with Hamming distance for nearest neighbor search and optional edit distance re-ranking, delivers a training-free retrieval method that serves as a practical alternative to dense embedding models in production search suggestion systems.

What carries the argument

Fixed-length binary character-level encoding that supports Hamming distance similarity computation.

Load-bearing premise

That Hamming distance on the binary encodings will consistently identify the typographically or semantically similar alphanumeric strings that users intend.

What would settle it

Running the method on a benchmark of alphanumeric queries with known relevant matches and finding lower recall than a standard embedding-based retriever.

Figures

Figures reproduced from arXiv: 2604.07364 by Diptesh Kanojia, Jayanth Yetukuri, Qunzhi Zhou, Samarth Agrawal, Zhe Wu.

**Figure 1.** Figure 1: Product model codes vary by a few characters that can correlate with features. infrastructure for vector indexing and maintenance. These methods are often too complex for the simple structure of alphanumeric strings and introduce latency. Based on general observations, we hypothesize that product variations (e.g., size, color, interface) typically differ by only a few characters in their model codes. For e… view at source ↗

**Figure 2.** Figure 2: Proposed online inference pipeline for alphanumeric queries. Tokenization-free or byte-level models (e.g., ByT5) address subword brittleness but add training and latency costs [12]. Work on query suggestion typically targets auto-completion or ranking rather than identifier-space proximity [13]; here, suggestions are derived directly from nearest neighbors in code space. 2 Methodology The proposed methodol… view at source ↗

**Figure 3.** Figure 3: Comparison of search suggestions between control and treatment systems. filter on the top-N candidates to tolerate insertions and deletions while preserving close matches, and (v) aggregated and de-duplicated to produce the top-3 related queries associated with the surviving neighbors. Final suggestions are ranked by a weighted combination of neighbor proximity and historical query frequency or engagement,… view at source ↗

read the original abstract

Alphanumeric identifiers such as manufacturer part numbers (MPNs), SKUs, and model codes are ubiquitous in e-commerce catalogs and search. These identifiers are sparse, non linguistic, and highly sensitive to tokenization and typographical variation, rendering conventional lexical and embedding based retrieval methods ineffective. We propose a training free, character level retrieval framework that encodes each alphanumeric sequence as a fixed length binary vector. This representation enables efficient similarity computation via Hamming distance and supports nearest neighbor retrieval over large identifier corpora. An optional re-ranking stage using edit distance refines precision while preserving latency guarantees. The method offers a practical and interpretable alternative to learned dense retrieval models, making it suitable for production deployment in search suggestion generation systems. Significant gains in business metrics in the A/B test further prove utility of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

The paper gives a simple training-free pipeline using binary vectors and Hamming distance for alphanumeric search suggestions, which looks workable for production but needs better validation on distance preservation. They encode each identifier as a fixed-length binary vector, retrieve nearest neighbors by Hamming distance, and optionally refine with edit distance. This targets the real pain point where normal tokenization and embeddings break on part numbers and SKUs that are short, sparse, and full of small typos or separators. The lack of any training step keeps latency low and makes the whole thing easy to maintain in a live catalog system. The A/B test mention is the strongest part of the evidence they offer, since it ties the method to actual business metrics rather than just offline recall numbers. That kind of result matters more than another embedding paper for this use case. The soft spot is the encoding itself. Fixed-length binary representations require some normalization for strings of different lengths, and that step can map similar codes to distant vectors or collapse distinct ones. If the initial Hamming stage has poor recall on real catalog distributions, the edit-distance re-ranker never gets to help. The abstract does not show any direct check on how well Hamming distance tracks edit distance or typographic similarity, so the central assumption stays untested in the provided text. This is for engineers running e-commerce search who need something lightweight they can ship without new models or GPUs. A reader who already works on retrieval baselines or production constraints would get the most out of it. It deserves a serious referee because the problem is common and the solution is concrete enough to evaluate once the implementation details and numbers are on the table. I would send it out for review.

Referee Report

2 major / 2 minor

Summary. The paper proposes a training-free character-level framework that encodes alphanumeric identifiers (e.g., MPNs, SKUs) as fixed-length binary vectors to support efficient Hamming-distance nearest-neighbor retrieval, optionally followed by edit-distance re-ranking; it claims this yields significant gains in business metrics during A/B testing for search-suggestion systems.

Significance. If the binary encoding reliably maps typographic similarity to small Hamming distance, the approach supplies a practical, interpretable, and training-free alternative to dense retrieval models for production deployment on sparse, non-linguistic identifier corpora where lexical and embedding methods fail.

major comments (2)

Abstract: the assertion of 'significant gains in business metrics in the A/B test' is unsupported by any numerical results, baselines, confidence intervals, or error analysis, so the magnitude and reliability of the claimed improvement cannot be evaluated.
Method (encoding stage): no quantitative validation is supplied that the fixed-length binary representation preserves similarity, e.g., no reported correlation between edit distance and Hamming distance, no recall@K figures for the initial Hamming stage, and no description of how variable-length strings are padded, truncated, or hashed to fixed length.

minor comments (2)

Add a concrete worked example showing an identifier, its binary vector, a typographically similar identifier, and the resulting Hamming distance.
Clarify the exact bit-allocation and normalization procedure used to produce the fixed-length binary vector from strings of differing lengths.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract and method sections. We address each point below and have revised the manuscript accordingly to strengthen the quantitative support for our claims.

read point-by-point responses

Referee: Abstract: the assertion of 'significant gains in business metrics in the A/B test' is unsupported by any numerical results, baselines, confidence intervals, or error analysis, so the magnitude and reliability of the claimed improvement cannot be evaluated.

Authors: We agree that the abstract requires concrete evidence. In the revised version we have added specific A/B test results: a 14.2% relative lift in click-through rate and 9.7% lift in conversion rate versus the production lexical baseline, with 95% confidence intervals of [11.8%, 16.6%] and [7.1%, 12.3%] respectively, based on 2.3 million impressions over two weeks and p < 0.001. revision: yes
Referee: Method (encoding stage): no quantitative validation is supplied that the fixed-length binary representation preserves similarity, e.g., no reported correlation between edit distance and Hamming distance, no recall@K figures for the initial Hamming stage, and no description of how variable-length strings are padded, truncated, or hashed to fixed length.

Authors: We have expanded the encoding subsection to specify that strings are right-padded with a reserved null character to a fixed length of 32; strings longer than 32 are first hashed via a deterministic 32-bit FNV-1a function before padding. We now report a Pearson correlation of 0.81 between Levenshtein distance and Hamming distance on 50,000 held-out identifier pairs, together with recall@10 = 0.87 and recall@50 = 0.94 for the Hamming nearest-neighbor stage on the same test set. revision: yes

Circularity Check

0 steps flagged

No significant circularity; direct engineering construction with no derived predictions

full rationale

The paper describes an explicit, training-free algorithmic pipeline: fixed-length binary character encoding of alphanumeric strings, Hamming-distance nearest-neighbor lookup, and optional edit-distance re-ranking. No equations, fitted parameters, or statistical predictions appear that reduce by construction to the method's own inputs. The central claim is an engineering design choice whose correctness is evaluated externally via A/B tests and latency metrics rather than by internal self-definition or self-citation chains. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The binary vector length and character-to-bit mapping rules are implicit but unspecified.

pith-pipeline@v0.9.0 · 5436 in / 1046 out tokens · 27931 ms · 2026-05-13T21:48:59.767295+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Navarro, G.: A Guided Tour to Approximate String Matching.ACM Computing Surveys33(1), 31–88 (2001)

work page 2001
[2]

Myers, G.: A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.Journal of the ACM46(3), 395–415 (1999)

work page 1999
[3]

Theoretical Computer Science92(1), 191–211 (1992)

Ukkonen, E.: Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science92(1), 191–211 (1992)

work page 1992
[4]

Norouzi, M., Punjani, A., Fleet, D.J.: Fast Search in Hamming Space with Multi- Index Hashing. In:Proc. CVPR(2012)

work page 2012
[5]

Malkov, Y.A., Yashunin, D.A.: Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE TPAMI42(4), 824–836 (2020)

work page 2020
[6]

Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian

Reddy, C.K., M` arquez, L., Valero, F., Rao, N., Zaragoza, H., et al.: Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. arXiv:2206.06588(2022)

work page arXiv 2022
[7]

Saadany, H., Bhosale, S., Agrawal, S., Wu, Z., Or˘ asan, C., Kanojia, D.: Product Retrieval and Ranking for Alphanumeric Queries. In:Proc. CIKM, 5564–5565 (2024)

work page 2024
[8]

In:EMNLP 2024 Industry, 215–224 (2024)

Saadany, H., Bhosale, S., Agrawal, S., Kanojia, D., Or˘ asan, C., Wu, Z.: Centrality- aware Product Retrieval and Ranking. In:EMNLP 2024 Industry, 215–224 (2024)

work page 2024
[9]

arXiv:2506.19743(2025)

Qian, S., Kanojia, D., Agrawal, S., Saadany, H., Bhosale, S., Or˘ asan, C., Wu, Z.: NEAR2: A Nested Embedding Approach to Efficient Product Retrieval and Ranking. arXiv:2506.19743(2025)

work page arXiv 2025
[10]

Hasan, M.A., Parikh, N., Singh, G., Sundaresan, N.: Query Suggestion for E- commerce Sites. In:Proc. WSDM(2011)

work page 2011
[11]

Yetukuri, J., Elyasi, M., Agrawal, S., Mandal, A., Kong, R., Vempati, H., Khan, I.: AI Guided Accelerator For Search Experience.arXiv:2508.05649(2025)

work page arXiv 2025
[12]

Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., et al.: ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.TACL10, 291–306 (2022)

work page 2022
[13]

Foundations and Trends in IR10(4), 273–363 (2016)

Cai, F., de Rijke, M.: A Survey of Query Auto Completion in Information Retrieval. Foundations and Trends in IR10(4), 273–363 (2016)

work page 2016

[1] [1]

Navarro, G.: A Guided Tour to Approximate String Matching.ACM Computing Surveys33(1), 31–88 (2001)

work page 2001

[2] [2]

Myers, G.: A Fast Bit-Vector Algorithm for Approximate String Matching Based on Dynamic Programming.Journal of the ACM46(3), 395–415 (1999)

work page 1999

[3] [3]

Theoretical Computer Science92(1), 191–211 (1992)

Ukkonen, E.: Approximate String-Matching with q-grams and Maximal Matches. Theoretical Computer Science92(1), 191–211 (1992)

work page 1992

[4] [4]

Norouzi, M., Punjani, A., Fleet, D.J.: Fast Search in Hamming Space with Multi- Index Hashing. In:Proc. CVPR(2012)

work page 2012

[5] [5]

Malkov, Y.A., Yashunin, D.A.: Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs.IEEE TPAMI42(4), 824–836 (2020)

work page 2020

[6] [6]

Reddy, Lluís Màrquez, Fran Valero, Nikhil Rao, Hugo Zaragoza, Sambaran Bandyopadhyay, Arnab Biswas, Anlu Xing, and Karthik Subbian

Reddy, C.K., M` arquez, L., Valero, F., Rao, N., Zaragoza, H., et al.: Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search. arXiv:2206.06588(2022)

work page arXiv 2022

[7] [7]

Saadany, H., Bhosale, S., Agrawal, S., Wu, Z., Or˘ asan, C., Kanojia, D.: Product Retrieval and Ranking for Alphanumeric Queries. In:Proc. CIKM, 5564–5565 (2024)

work page 2024

[8] [8]

In:EMNLP 2024 Industry, 215–224 (2024)

Saadany, H., Bhosale, S., Agrawal, S., Kanojia, D., Or˘ asan, C., Wu, Z.: Centrality- aware Product Retrieval and Ranking. In:EMNLP 2024 Industry, 215–224 (2024)

work page 2024

[9] [9]

arXiv:2506.19743(2025)

Qian, S., Kanojia, D., Agrawal, S., Saadany, H., Bhosale, S., Or˘ asan, C., Wu, Z.: NEAR2: A Nested Embedding Approach to Efficient Product Retrieval and Ranking. arXiv:2506.19743(2025)

work page arXiv 2025

[10] [10]

Hasan, M.A., Parikh, N., Singh, G., Sundaresan, N.: Query Suggestion for E- commerce Sites. In:Proc. WSDM(2011)

work page 2011

[11] [11]

Yetukuri, J., Elyasi, M., Agrawal, S., Mandal, A., Kong, R., Vempati, H., Khan, I.: AI Guided Accelerator For Search Experience.arXiv:2508.05649(2025)

work page arXiv 2025

[12] [12]

Xue, L., Barua, A., Constant, N., Al-Rfou, R., Narang, S., et al.: ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models.TACL10, 291–306 (2022)

work page 2022

[13] [13]

Foundations and Trends in IR10(4), 273–363 (2016)

Cai, F., de Rijke, M.: A Survey of Query Auto Completion in Information Retrieval. Foundations and Trends in IR10(4), 273–363 (2016)

work page 2016