Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Bin Kuang; Bin Wen; Bi Xue; Chao Deng; Dennis van der Staay; Eddy Li; Emma Lin; Henry Wei; Kai Ren; Kaustubh Vartak

arxiv: 2602.17050 · v3 · pith:ZA4VTAOQnew · submitted 2026-02-19 · 💻 cs.LG

Multi-Probe Zero Collision Hash (MPZCH): Mitigating Embedding Collisions and Enhancing Model Freshness in Large-Scale Recommenders

Ziliang Zhao , Bi Xue , Emma Lin , Tianqi Lu , Mengjiao Zhou , Kaustubh Vartak , Shakhzod Ali-Zade , Tao Li

show 11 more authors

Bin Kuang Rui Jian Bin Wen Dennis van der Staay Yixin Bao Eddy Li Chao Deng Henry Wei Songbin Liu Qifan Wang Kai Ren

This is my paper

Pith reviewed 2026-05-21 11:56 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-probe hashembedding collisionsrecommendation systemsmodel freshnesslinear probingeviction policyCUDA implementationTorchRec

0 comments

The pith

A multi-probe hash technique removes embedding collisions and stale data in large recommender models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Embedding tables in recommendation systems map many unique IDs to vectors but suffer from collisions when IDs share the same table slots. The paper shows that MPZCH uses linear probing across multiple locations and active eviction to avoid these overlaps and clear out old data. A sympathetic reader would care because collisions reduce the quality of personalized recommendations and prevent new features from learning properly. The method keeps the system fast enough for production by using optimized code on GPUs. Online tests confirm it achieves no collisions for users and better item embeddings.

Core claim

MPZCH is a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods.

What carries the argument

Multi-probe linear probing combined with active eviction policies using auxiliary tensors and CUDA kernels to manage embedding table slots.

If this is right

Collisions are eliminated with reasonable table sizes.
Obsolete IDs are retired to prevent stale inheritance in reassigned slots.
New features start learning from scratch without interference.
Training and inference speeds match current production levels.
Item embedding quality improves in live experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique might extend to other systems with high-cardinality inputs like search or advertising models.
Adjusting the number of probes dynamically could balance collision avoidance against computation cost in varying workloads.
The open release allows testing whether the zero-collision property holds across different recommendation architectures.
Integration with existing feature stores could require changes to ID management to fully benefit from the freshness gains.

Load-bearing premise

The auxiliary structures and eviction logic add so little overhead that real production training speed and serving speed stay unchanged.

What would settle it

A direct comparison of training queries per second before and after adopting MPZCH in a live large-scale recommender system.

Figures

Figures reproduced from arXiv: 2602.17050 by Bin Kuang, Bin Wen, Bi Xue, Chao Deng, Dennis van der Staay, Eddy Li, Emma Lin, Henry Wei, Kai Ren, Kaustubh Vartak, Mengjiao Zhou, Qifan Wang, Rui Jian, Shakhzod Ali-Zade, Songbin Liu, Tao Li, Tianqi Lu, Yixin Bao, Ziliang Zhao.

**Figure 2.** Figure 2: An example of kernel execution with TTL eviction policy. Probing details are omitted for clarity, with arrows pointing [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: An example of kernel execution with LRU eviction policy. Probing details are omitted for clarity, with arrows pointing [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: The MPZCH module is co-sharded with the associated embedding table, ensuring that operations are executed locally [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Top: Video embeddings from selected creators in the baseline model appear dispersed and uncorrelated. Bottom: With [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Top: In the absence of MPZCH, video embeddings exhibit volatile shifts attributed to hash collisions. Bottom: MPZCH [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Embedding tables are critical components of large-scale recommendation systems, facilitating the efficient mapping of high-cardinality categorical features into dense vector representations. However, as the volume of unique IDs expands, traditional hash-based indexing methods suffer from collisions that degrade model performance and personalization quality. We present Multi-Probe Zero Collision Hash (MPZCH), a novel indexing mechanism based on linear probing that effectively mitigates embedding collisions. With reasonable table sizing, it often eliminates these collisions entirely while maintaining production-scale efficiency. MPZCH utilizes auxiliary tensors and high-performance CUDA kernels to implement configurable probing and active eviction policies. By retiring obsolete IDs and resetting reassigned slots, MPZCH prevents the stale embedding inheritance typical of hash-based methods, ensuring new features learn effectively from scratch. Despite its collision-mitigation overhead, the system maintains training QPS and inference latency comparable to existing methods. Rigorous online experiments demonstrate that MPZCH achieves zero collisions for user embeddings and significantly improves item embedding freshness and quality. The solution has been released within the open-source TorchRec library for the broader community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MPZCH pairs multi-probe linear hashing with active eviction to target collisions and stale embeddings, but the efficiency claims need timing data to hold up at scale.

read the letter

MPZCH pairs multi-probe linear hashing with active eviction to target collisions and stale embeddings, but the efficiency claims need timing data to hold up at scale. The core idea is straightforward: use auxiliary tensors and configurable probing to find free slots, then evict old IDs so reassigned slots start clean instead of carrying over previous vectors. This directly tackles two production headaches in large embedding tables—hash collisions that hurt personalization and stale embeddings that slow learning on new features. The paper shows how this fits into TorchRec with CUDA kernels for the probing and eviction steps, and it reports online results with zero collisions on user embeddings plus better item freshness. Releasing the code is a practical move that lets others plug it in and measure for themselves. The combination of collision avoidance and explicit freshness reset is not something I recall seeing packaged exactly this way in the references they cite, so that part feels like the actual increment. The main soft spot is the overhead story. The description asserts that training QPS and inference latency stay comparable to baseline hashing, yet there are no head-to-head tables, per-kernel timings, or cache-miss counts at the cardinalities and batch sizes that matter (hundreds of millions of IDs). If the extra atomic operations and memory traffic from probing and eviction add up, the “production-scale efficiency” part could slip even if collisions reach zero. The online experiments are cited as rigorous, but without the quantitative details on how collisions were counted or what the freshness metric actually moved, it is difficult to judge the size of the win. This paper is aimed at practitioners who run large-scale recommenders and already fight hash collisions in their embedding tables. Someone implementing in TorchRec or a similar framework could pull useful implementation patterns from it. It is worth sending to peer review because it describes a concrete mechanism for a real deployed limitation and ships code; referees can push for the missing benchmark numbers and clarify the measurement setup. I would not cite it in my own work unless the overhead data improves, but the approach itself is worth testing on a production workload.

Referee Report

1 major / 1 minor

Summary. The paper introduces Multi-Probe Zero Collision Hash (MPZCH), an indexing scheme for embedding tables in large-scale recommendation models. It employs multi-probe linear probing, auxiliary tensors, and CUDA kernels implementing configurable probing and active eviction to eliminate collisions under reasonable table sizing, retire obsolete IDs to avoid stale embedding inheritance, and maintain training QPS and inference latency comparable to standard hash-based methods. The work claims zero collisions for user embeddings and improved item freshness in rigorous online experiments, with the implementation released in TorchRec.

Significance. If the efficiency and collision-free claims hold under production workloads, the approach would address a practical pain point in recommender systems by improving embedding quality and freshness without throughput penalties. The open-source release in TorchRec is a positive contribution that enables reproducibility and adoption. However, the current manuscript provides no quantitative metrics, timing data, or baseline comparisons to substantiate the central performance assertions.

major comments (1)

[Section 4] Section 4 and CUDA kernel description: The claim that auxiliary tensors, multi-probe linear probing, and active eviction maintain training QPS and inference latency comparable to standard TorchRec hashing is load-bearing for the 'production-scale efficiency' premise, yet the manuscript supplies no per-operation timing, cache-miss rates, atomic-operation overhead measurements, or head-to-head QPS/latency tables versus baseline for cardinalities of 10^8–10^9 IDs and realistic batch sizes. Without these data the efficiency assertion cannot be evaluated.

minor comments (1)

[Abstract] The abstract asserts 'rigorous online experiments demonstrate zero collisions' but provides no quantitative metrics, baseline comparisons, statistical details, or description of collision measurement methodology; these details should be added to the experimental section for verifiability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the manuscript. We agree that quantitative efficiency metrics are necessary to support the production-scale claims and have planned revisions to address this directly.

read point-by-point responses

Referee: [Section 4] Section 4 and CUDA kernel description: The claim that auxiliary tensors, multi-probe linear probing, and active eviction maintain training QPS and inference latency comparable to standard TorchRec hashing is load-bearing for the 'production-scale efficiency' premise, yet the manuscript supplies no per-operation timing, cache-miss rates, atomic-operation overhead measurements, or head-to-head QPS/latency tables versus baseline for cardinalities of 10^8–10^9 IDs and realistic batch sizes. Without these data the efficiency assertion cannot be evaluated.

Authors: We acknowledge that the current manuscript does not include the requested quantitative benchmarks, which limits the ability to fully evaluate the efficiency claims. In the revised version we will add a new subsection to Section 4 containing head-to-head QPS and latency tables for MPZCH versus standard TorchRec hashing. These will report training and inference measurements on tables with cardinalities of 10^8 and 5×10^8 IDs using batch sizes of 1024–4096, along with per-operation timings, cache-miss rates, and atomic-operation overhead obtained via CUDA profiling. We will also note the practical limits of running full 10^9-ID experiments and provide the most representative data available from our test environment. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering implementation with experimental validation

full rationale

The paper presents MPZCH as a systems-level indexing mechanism using multi-probe linear probing, auxiliary tensors, configurable eviction, and CUDA kernels to reduce embedding collisions while claiming maintained QPS/latency. No mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Collision elimination is asserted via table sizing and probing policy rather than by construction from prior fitted results; performance claims rest on described online experiments and implementation details, not on quantities that reduce to the paper's own inputs. This is a standard non-circular engineering description.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about linear probing behavior and practical engineering choices for table sizing and eviction thresholds rather than new theoretical entities or fitted constants.

free parameters (2)

table sizing factor
Chosen as 'reasonable' to achieve zero collisions; exact value or selection procedure not specified in abstract.
probing depth
Configurable number of probes; treated as a tunable parameter for the method.

axioms (1)

domain assumption Linear probing with sufficient table size and auxiliary structures can locate empty slots without excessive overhead
Invoked to support the zero-collision claim under production constraints.

pith-pipeline@v0.9.0 · 5797 in / 1183 out tokens · 45117 ms · 2026-05-21T11:56:06.830153+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Table 1: Collision rate comparison... MPZCH achieves zero collisions with adequate capacity and probe depth

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 1 internal anchor

[1]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems (RecSys)

work page 2016
[2]

Aditya Desai et al. 2022. Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems. InProceedings of Machine Learning and Systems (MLSys)

work page 2022
[3]

Benjamin Ghaemmaghami et al. 2022. Learning to Collide: Recommendation System Model Compression with Learned Hash Functions. InarXiv preprint arXiv:2203.14996

work page arXiv 2022
[4]

Antonio Ginart et al. 2021. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. InIEEE International Symposium on Information Theory (ISIT)

work page 2021
[5]

Huan et al

Y. Huan et al. 2023. Multiplexed Embeddings for Large-Scale Recommendation. arXiv preprint arXiv:2310.04400(2023)

work page arXiv 2023
[6]

Wang-Cheng Kang et al. 2020. Learning to Embed Categorical Features without Embedding Tables for Recommendation. InarXiv preprint arXiv:2010.10784

work page arXiv 2020
[7]

Kulkarni et al

R. Kulkarni et al . 2024. Semantic IDs for Recommendation.arXiv preprint arXiv:2405.03988(2024)

work page arXiv 2024
[8]

Zhuoran Liu et al. 2022. Monolith: Real Time Recommendation System With Collisionless Embedding Table.arXiv preprint arXiv:2209.07663(2022)

work page arXiv 2022
[9]

Qi Pi et al. 2020. Practice on Long Sequential User Behavior Modeling for Click- Through Rate Prediction. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD)

work page 2020
[10]

Hao-Jun Shi et al . 2020. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems. InProceedings of 7 Table 3: Normalized Entropy (NE) improvements for critical tasks following the deployment of MPZCH for the user embedding table. Prediction Task NE Improvement Share 0.38% Video View Duration (VVD) 0.12% Video V...

work page 2020
[11]

Kilian Weinberger et al. 2009. Feature Hashing for Large Scale Multitask Learning. InProceedings of the 26th Annual International Conference on Machine Learning 8 Table 4: Comparison of intra-creator embedding similarity. MPZCH consistently yields higher similarity scores across all creation time windows, with the most significant gains observed for video...

work page 2009
[12]

Yao Yang et al. 2020. On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models. InarXiv preprint arXiv:2007.14523

work page arXiv 2020
[13]

Chunxing Yin et al. 2021. TT-Rec: Tensor Train Compression for Deep Learn- ing Recommendation Models. InProceedings of Machine Learning and Systems (MLSys)

work page 2021
[14]

Jiaqi Zhai, Lucy Liao, Xing Liu, Yiyuan Liu, Bokang Cao, Patrick Li, Zhaojie Sun, Boxin Zhang, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion- Parameter Sequential Transducers for Generative Recommendations. InForty- first International Conference on Machine Learning (ICML). https://arxiv.org/abs/ 2402.17152

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Zhang et al

B. Zhang et al. 2023. Unified Embedding: Efficiently Learning Embedding Tables for Recommendation Models.arXiv preprint arXiv:2305.12102(2023). 9

work page arXiv 2023
[16]

Shuai Zhang et al. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives.ACM Computing Surveys (CSUR)52, 1 (2019)

work page 2019
[17]

Zhang et al

Y. Zhang et al . 2020. Model Size Reduction Using Frequency Based Double Hashing for Recommender Systems. InProceedings of the 14th ACM Conference on Recommender Systems (RecSys). 10

work page 2020

[1] [1]

Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks for YouTube Recommendations. InProceedings of the 10th ACM Conference on Recommender Systems (RecSys)

work page 2016

[2] [2]

Aditya Desai et al. 2022. Random Offset Block Embedding (ROBE) for compressed embedding tables in deep learning recommendation systems. InProceedings of Machine Learning and Systems (MLSys)

work page 2022

[3] [3]

Benjamin Ghaemmaghami et al. 2022. Learning to Collide: Recommendation System Model Compression with Learned Hash Functions. InarXiv preprint arXiv:2203.14996

work page arXiv 2022

[4] [4]

Antonio Ginart et al. 2021. Mixed Dimension Embeddings with Application to Memory-Efficient Recommendation Systems. InIEEE International Symposium on Information Theory (ISIT)

work page 2021

[5] [5]

Huan et al

Y. Huan et al. 2023. Multiplexed Embeddings for Large-Scale Recommendation. arXiv preprint arXiv:2310.04400(2023)

work page arXiv 2023

[6] [6]

Wang-Cheng Kang et al. 2020. Learning to Embed Categorical Features without Embedding Tables for Recommendation. InarXiv preprint arXiv:2010.10784

work page arXiv 2020

[7] [7]

Kulkarni et al

R. Kulkarni et al . 2024. Semantic IDs for Recommendation.arXiv preprint arXiv:2405.03988(2024)

work page arXiv 2024

[8] [8]

Zhuoran Liu et al. 2022. Monolith: Real Time Recommendation System With Collisionless Embedding Table.arXiv preprint arXiv:2209.07663(2022)

work page arXiv 2022

[9] [9]

Qi Pi et al. 2020. Practice on Long Sequential User Behavior Modeling for Click- Through Rate Prediction. InProceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD)

work page 2020

[10] [10]

Hao-Jun Shi et al . 2020. Compositional Embeddings Using Complementary Partitions for Memory-Efficient Recommendation Systems. InProceedings of 7 Table 3: Normalized Entropy (NE) improvements for critical tasks following the deployment of MPZCH for the user embedding table. Prediction Task NE Improvement Share 0.38% Video View Duration (VVD) 0.12% Video V...

work page 2020

[11] [11]

Kilian Weinberger et al. 2009. Feature Hashing for Large Scale Multitask Learning. InProceedings of the 26th Annual International Conference on Machine Learning 8 Table 4: Comparison of intra-creator embedding similarity. MPZCH consistently yields higher similarity scores across all creation time windows, with the most significant gains observed for video...

work page 2009

[12] [12]

Yao Yang et al. 2020. On the Factory Floor: ML Engineering for Industrial-Scale Ads Recommendation Models. InarXiv preprint arXiv:2007.14523

work page arXiv 2020

[13] [13]

Chunxing Yin et al. 2021. TT-Rec: Tensor Train Compression for Deep Learn- ing Recommendation Models. InProceedings of Machine Learning and Systems (MLSys)

work page 2021

[14] [14]

Jiaqi Zhai, Lucy Liao, Xing Liu, Yiyuan Liu, Bokang Cao, Patrick Li, Zhaojie Sun, Boxin Zhang, and Yu Shi. 2024. Actions Speak Louder than Words: Trillion- Parameter Sequential Transducers for Generative Recommendations. InForty- first International Conference on Machine Learning (ICML). https://arxiv.org/abs/ 2402.17152

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Zhang et al

B. Zhang et al. 2023. Unified Embedding: Efficiently Learning Embedding Tables for Recommendation Models.arXiv preprint arXiv:2305.12102(2023). 9

work page arXiv 2023

[16] [16]

Shuai Zhang et al. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives.ACM Computing Surveys (CSUR)52, 1 (2019)

work page 2019

[17] [17]

Zhang et al

Y. Zhang et al . 2020. Model Size Reduction Using Frequency Based Double Hashing for Recommender Systems. InProceedings of the 14th ACM Conference on Recommender Systems (RecSys). 10

work page 2020