pith. sign in

arxiv: 2606.19458 · v1 · pith:MY5ZZVISnew · submitted 2026-06-17 · 💻 cs.IR

MonaVec: A Training-Free Embedded Vector Search Kernel for Edge and Offline AI Systems

Pith reviewed 2026-06-26 19:04 UTC · model grok-4.3

classification 💻 cs.IR
keywords vector searchquantizationedge AItraining-freedeterministicoffline retrievalembedded systemsrandomized transform
0
0 comments X

The pith

A randomized Hadamard transform enables training-free 4-bit quantization for deterministic vector search on edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MonaVec as a vector search kernel that operates without training data, persistent servers, or large memory, using a single file format. It achieves this by applying a Randomized Hadamard Transform to condition embedding vectors toward a standard normal distribution, allowing precomputed Lloyd-Max quantization tables to produce 4-bit representations. On a test set of 45,000 1024-dimensional embeddings, the 4-bit brute-force version delivers 0.96 recall at 10 in just 27 MB while guaranteeing byte-identical results across runs. This approach targets offline AI applications where existing libraries fall short due to their reliance on training or cloud infrastructure.

Core claim

MonaVec shows that a Randomized Hadamard Transform followed by precomputed 4-bit Lloyd-Max quantization produces an embedded vector index that requires no training pass and persists as a single deterministic file, reaching 0.960 Recall@10 on semantic embeddings while using 27 MB of storage.

What carries the argument

The Randomized Hadamard Transform, which rotates and scales input embeddings to approximate a standard normal distribution for use with fixed quantization tables.

Load-bearing premise

The Randomized Hadamard Transform makes arbitrary embedding distributions sufficiently close to normal for precomputed quantization tables to work without any per-dataset adjustment.

What would settle it

Measure recall on a dataset of embeddings whose distribution resists normalization by the Hadamard transform, such as those with heavy tails or strong correlations, and check if it falls below that of trained quantizers.

Figures

Figures reproduced from arXiv: 2606.19458 by O\u{g}uzhan Yenen.

Figure 1
Figure 1. Figure 1: MonaVec quantization pipeline. Cosine inputs are unit-normalized; L2 inputs optionally standardized via fit(); Dot inputs are raw. All paths share the RHDH rotation and Lloyd-Max quantization stages. A taxonomy of data dependence. Because MonaVec is described as “training-free,” it is worth stating precisely which stages touch the data. The default configuration—BruteForce search over the RHDH + Lloyd-Max … view at source ↗
Figure 2
Figure 2. Figure 2: Centroid placement for uniform quantization versus Lloyd-Max on [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mixed-precision quantization: Recall@10 and compression ratio for pure 2-bit, mixed [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Why M must scale with N. With low M (left), the graph is sparse and its diameter is large—greedy search from the query entry (⋆) frequently stalls before reaching the true neighbour (×). With high M (right), the dense graph has small diameter and reliable navigation. At N=1.18M, this is the difference between Recall@10 of 0.800 (M=32) and 0.850 (M=64). As [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SIMD kernel speedup relative to scalar (4-bit dot product, d=1024). AVX2+FMA [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MonaVec Recall@10 across three workloads. On semantic embeddings (AG News, glove-100)—the primary target—both BruteForce and HNSW exceed 0.85 recall. On raw pixels (fashion-mnist), scalar quantization reaches 0.62, a structural limit discussed in Section 5.1. 4.2 Main Results: Semantic Embeddings (AG News) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: L2 standardization ablation on fashion-mnist. No preprocessing (raw) gives 0.41 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Recall@10 vs. QPS tradeoff on glove-100 (1.18M vectors) for M=32 and M=64. The [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Recall@10 vs. QPS against usearch and hnswlib on two cosine datasets (single-core, [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Memory footprint: float32 vs. MonaVec 4-bit. The 8× reduction allows 1M × 768-dim to fit in 384 MB instead of 3.1 GB—within the RAM budget of mobile flagship devices. At scale, the memory advantage is decisive: 1M × 1536-dim vectors occupy 768 MB at 4-bit vs 6.1 GB at float32. On a Raspberry Pi 5 (8 GB RAM), MonaVec can serve a 1M-vector index alongside an on-device LLM; float32 storage makes this impossi… view at source ↗
Figure 11
Figure 11. Figure 11: Recall@10 vs. throughput (log QPS) on AG News 45K [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
read the original abstract

We present MonaVec, a deterministic, embedded vector-search kernel for edge and offline AI -- settings where server infrastructure, network connectivity, and training data are all unavailable. Existing vector-search systems assume a persistent server, gigabytes of RAM, or a training pass over the corpus; MonaVec instead targets the deployment profile of SQLite: one file, one function call, runs anywhere. Its quantization core is training-free by default and data-oblivious: a Randomized Hadamard Transform (RHDH) conditions any input distribution toward N(0,1), so precomputed Lloyd-Max tables quantize to 4 bits (8x smaller) with no learned codebook and no data pass. The index persists as a single .mvec file whose embedded ChaCha20 rotation seed makes results reproducible across architectures and byte-identical within a build -- a determinism guarantee that parallel-build graph libraries cannot offer. On semantic embeddings (AG News, 45K x 1024-dim BGE-M3, cosine), MonaVec 4-bit BruteForce reaches 0.960 Recall@10 in 27 MB -- leading float32 FAISS-IVF and 8-bit usearch on recall -- while trading peak throughput for byte-identical determinism. A single-pass global standardization (fit()) extends the same data-oblivious pipeline to magnitude-sensitive L2 data, and optional IvfFlat and HNSW backends carry it to million-vector corpora. MonaVec is implemented in pure Rust with Python bindings and runtime SIMD dispatch (AVX-512/AVX2/NEON/scalar). It targets on-device RAG, offline agents, and embedded retrieval -- the niche SQLite occupies for relational data: one file, one call, runs anywhere.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MonaVec, a deterministic embedded vector-search kernel for edge and offline AI. It uses a Randomized Hadamard Transform (RHDH) to condition input embeddings toward N(0,1), enabling precomputed Lloyd-Max tables for training-free 4-bit quantization. On AG News (45K × 1024-dim BGE-M3, cosine), the 4-bit BruteForce variant reports 0.960 Recall@10 in 27 MB while outperforming float32 FAISS-IVF and 8-bit usearch on recall; optional IVF/HNSW backends and a single-pass standardization extend it to larger or L2 corpora. The index is a single .mvec file with embedded ChaCha20 seed for byte-identical reproducibility.

Significance. If the central claims hold, the work addresses a practical niche in cs.IR for on-device retrieval by delivering a training-free, single-file, architecture-portable kernel with strong determinism guarantees. The emphasis on data-oblivious quantization and embedded reproducibility is a clear strength for offline RAG and embedded agents; the pure-Rust implementation with SIMD dispatch further supports deployability.

major comments (2)
  1. [Abstract] Abstract: the reported 0.960 Recall@10 (and all other numeric claims) is presented with no experimental details, error bars, full baseline configurations, or verification of the distribution-conditioning claim, so the numbers cannot be assessed from the given text.
  2. [Quantization pipeline description] Quantization pipeline description: the claim that a single RHDH produces marginals sufficiently close to N(0,1) for precomputed Lloyd-Max tables to incur negligible extra error is load-bearing for both the recall figure and the training-free guarantee, yet no Kolmogorov-Smirnov distances, tail quantiles, or post-transform distribution statistics are provided on the AG News BGE-M3 corpus.
minor comments (1)
  1. [Abstract] The interaction between the optional single-pass global standardization (fit()) and the data-oblivious claim for magnitude-sensitive L2 data is stated but not formalized.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's constructive feedback on our manuscript. Below we provide point-by-point responses to the major comments. We agree with the need for greater transparency in the abstract and additional empirical support for the core technical claim, and we will incorporate revisions to address these points.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 0.960 Recall@10 (and all other numeric claims) is presented with no experimental details, error bars, full baseline configurations, or verification of the distribution-conditioning claim, so the numbers cannot be assessed from the given text.

    Authors: We acknowledge that the abstract, due to its brevity, does not include the full experimental details. The manuscript's Experiments section provides the complete setup: AG News corpus (45K documents), 1024-dimensional BGE-M3 embeddings, cosine similarity metric, and comparisons against float32 FAISS-IVF and 8-bit usearch. The method's determinism via fixed ChaCha20 seed means there is no run-to-run variance, rendering error bars inapplicable. To improve the abstract, we will add a concise description of the evaluation protocol and note the location of detailed results in the paper. The distribution-conditioning verification is addressed in our response to the second comment. revision: yes

  2. Referee: [Quantization pipeline description] Quantization pipeline description: the claim that a single RHDH produces marginals sufficiently close to N(0,1) for precomputed Lloyd-Max tables to incur negligible extra error is load-bearing for both the recall figure and the training-free guarantee, yet no Kolmogorov-Smirnov distances, tail quantiles, or post-transform distribution statistics are provided on the AG News BGE-M3 corpus.

    Authors: The referee is correct that explicit empirical validation of the post-RHDH marginal distributions would strengthen the presentation. Although the use of RHDH for approximate Gaussianization draws on established theoretical results, we will add to the revised manuscript a dedicated subsection or appendix with the requested statistics. This will include Kolmogorov-Smirnov test distances to N(0,1), comparisons of tail quantiles, and summary statistics for the transformed embeddings from the AG News BGE-M3 corpus. These additions will confirm that the approximation error is negligible for the 4-bit quantization. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a data-oblivious pipeline consisting of a fixed Randomized Hadamard Transform followed by precomputed Lloyd-Max quantization tables, with results on the AG News corpus reported as empirical outcomes rather than quantities derived by construction from the evaluation set. No equations or steps reduce the claimed recall to a fitted parameter or self-referential definition, and the manuscript contains no self-citations, uniqueness theorems, or ansatzes that would create load-bearing circularity. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the statistical effect of the Randomized Hadamard Transform and the appropriateness of fixed Lloyd-Max tables for the resulting distribution; no free parameters are fitted to target data and no new entities are postulated.

axioms (1)
  • domain assumption Randomized Hadamard Transform conditions any input distribution toward N(0,1)
    Invoked to justify use of precomputed Lloyd-Max tables without a data pass or learned codebook.

pith-pipeline@v0.9.1-grok · 5854 in / 1248 out tokens · 45706 ms · 2026-06-26T19:04:39.821835+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Gerganov et al

    G. Gerganov et al. llama.cpp: Inference of LLaMA model in pure C/C++. https: //github.com/ggerganov/llama.cpp, 2023

  2. [2]

    T. Chen, L. Zheng, Z. Shen, et al. MLC-LLM.https://github.com/mlc-ai/mlc-llm, 2023

  3. [3]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems (NeurIPS), 2020

  4. [4]

    Qdrant: Vector Search Engine.https://qdrant.tech, 2021

    Qdrant Team. Qdrant: Vector Search Engine.https://qdrant.tech, 2021

  5. [5]

    Weaviate: Open source vector database.https://weaviate.io, 2021

    Weaviate Team. Weaviate: Open source vector database.https://weaviate.io, 2021

  6. [6]

    Johnson, M

    J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with GPUs.IEEE Transactions on Big Data, 7(3):535–547, 2021

  7. [7]

    R. Hipp. SQLite.https://www.sqlite.org, 2000

  8. [8]

    S. P. Lloyd. Least squares quantization in PCM.IEEE Transactions on Information Theory, 28(2):129–137, 1982

  9. [9]

    J. Max. Quantizing for minimum distortion.IRE Transactions on Information Theory, 6(1):7–12, 1960

  10. [10]

    Ailon and B

    N. Ailon and B. Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

  11. [11]

    S. S. Vempala.The Random Projection Method, volume 65 ofDIMACS Series in Discrete Mathematics and Theoretical Computer Science. American Mathematical Society, 2004

  12. [12]

    IEEE Standard for Floating-Point Arithmetic.IEEE Std 754-2019, 2019

    IEEE. IEEE Standard for Floating-Point Arithmetic.IEEE Std 754-2019, 2019

  13. [13]

    Goldberg

    D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991

  14. [14]

    D. J. Bernstein. ChaCha, a variant of Salsa20. InWorkshop Record of SASC, 2008

  15. [15]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2404.00456

  16. [16]

    Z. Liu, C. Zhao, I. Fedorov, B. Soran, D. Choudhary, R. Krishnamoorthi, V. Chandra, Y. Tian, and T. Blankevoort. SpinQuant: LLM quantization with learned rotations. In International Conference on Learning Representations (ICLR), 2025. arXiv:2405.16406. 26 MonaVec — Embedded Vector Search for Edge AI 2026

  17. [17]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    A. Zandieh, M. Daliri, M. Hadian, and V. Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations (ICLR), 2026. arXiv:2504.19874

  18. [18]

    T. M. Cover and J. A. Thomas.Elements of Information Theory, 2nd ed. Wiley, 2006

  19. [19]

    Gao and C

    J. Gao and C. Long. RaBitQ: Quantizing High-Dimensional Vectors with a Theoretical Error Bound for Approximate Nearest Neighbor Search.Proceedings of the ACM on Management of Data (SIGMOD), 2(3):Article 167, 2024

  20. [20]

    Y. A. Malkov and D. A. Yashunin. Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.IEEE TPAMI, 42(4):824–836, 2020

  21. [21]

    G. V. Cormack, C. L. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. InProceedings of SIGIR, 2009

  22. [22]

    Formal, B

    T. Formal, B. Piwowarski, and S. Clinchant. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. InProceedings of SIGIR, 2021

  23. [23]

    J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu. M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distil- lation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 2318–2335, 2024. arXiv:2402.03216

  24. [24]

    Vardanian (Unum)

    A. Vardanian (Unum). USearch: Smaller & faster single-file vector search engine.https: //github.com/unum-cloud/usearch, 2023

  25. [25]

    Y. A. Malkov and D. A. Yashunin. hnswlib – fast approximate nearest neighbor search. https://github.com/nmslib/hnswlib, 2018

  26. [26]

    A. Garcia. sqlite-vec: A vector search SQLite extension that runs anywhere. https: //github.com/asg017/sqlite-vec, 2024

  27. [27]

    Jégou, M

    H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE TPAMI, 33(1):117–128, 2011

  28. [28]

    T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization.IEEE TPAMI, 36(4):744– 755, 2014

  29. [29]

    R. Guo, P. Sun, E. Lindgren, Q. Geng, D. Simcha, F. Chern, and S. Kumar. Accelerating large-scale inference with anisotropic vector quantization. InProceedings of ICML, 2020

  30. [30]

    Khattab and M

    O. Khattab and M. Zaharia. ColBERT: Efficient and Effective Passage Search via Contex- tualized Late Interaction over BERT. InProceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 39–48, 2020

  31. [31]

    ColPali: Efficient Document Retrieval with Vision Language Models

    M. Faysse, H. Sibille, T. Wu, B. Omrani, G. Viaud, C. Hudelot, and P. Colombo. ColPali: Efficient Document Retrieval with Vision Language Models.arXiv preprint arXiv:2407.01449, 2024. 27