Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Clark Labs Inc; Stanislav Kirdey

arxiv: 2605.28034 · v1 · pith:O2UTVV5Ynew · submitted 2026-05-27 · 💻 cs.AI

Clark Hash: Stateless Sparse Johnson-Lindenstrauss Quantization for Neural Embeddings

Stanislav Kirdey , Clark Labs Inc This is my paper

Pith reviewed 2026-06-29 13:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords neural embeddingsJohnson-Lindenstrauss projectionscalar quantizationcosine similaritysentence similaritystateless codeccompact storagemultilingual embeddings

0 comments

The pith

A stateless sparse Johnson-Lindenstrauss projection followed by clipping and scalar quantization stores 384-dimensional neural embeddings in 48 bytes while retaining cosine similarity information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Clark Hash as a fixed pipeline that normalizes each embedding, multiplies it by a deterministic sparse signed projection matrix, clips the resulting values, and packs them into a short scalar-quantized code. In the default setting this code occupies 48 bytes instead of the 1536 bytes required for 384-dimensional float32 vectors, a 32-fold reduction achieved without training, learned codebooks, or any corpus statistics. Queries remain in floating point and are scored directly against the stored codes using the same cosine measure. On two standard multilingual sentence-similarity collections the 48-byte codes produced Pearson correlations of 0.910 and 0.946 with the original dense scores. The method is presented strictly as a storage codec, not as a new theoretical guarantee or as a substitute for approximate nearest-neighbor indexes.

Core claim

Clark Hash applies a deterministic sparse signed Johnson-Lindenstrauss projection to normalized embedding vectors, clips the projected coordinates, and stores the result as a fixed-width scalar-quantized integer code. The resulting 48-byte sketches are scored against full-precision query vectors by the same cosine function used on the original dense vectors. On the STS17 and STS22 collections the sketches achieve macro Pearson correlations of 0.910 and 0.946 with the dense baseline when the underlying encoder is a multilingual MiniLM model.

What carries the argument

Deterministic sparse signed Johnson-Lindenstrauss projection followed by clipping and scalar quantization, which maps each normalized vector to a short fixed-length integer code that approximately preserves inner-product information.

If this is right

Embedding collections can be stored using 32 times less memory than dense float32 vectors.
New vectors can be encoded and inserted without retraining the codec or recomputing any corpus statistics.
Query vectors stay in full floating-point precision and are compared directly to the stored codes.
The codec works for any embedding dimensionality once the projection matrix is fixed and requires no learned parameters.
It functions as a lightweight alternative to methods that rely on data-dependent codebooks or rotations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fixed projection could be applied to embeddings from domains other than sentences, such as image or graph representations, to test whether the cosine preservation holds without retraining.
Because the projection matrix is deterministic and stateless, multiple independent databases could share the identical encoding scheme and be merged without re-encoding.
Pairing the sketches with an existing approximate nearest-neighbor index would allow memory-efficient search at the cost of an extra decompression or scoring step against the quantized codes.
The method opens a route to theoretical bounds on the worst-case distortion of cosine similarity under this specific clipping-plus-quantization pipeline.

Load-bearing premise

The fixed sparse signed projection, after clipping and quantization, preserves enough of the original cosine information for the sentence-similarity tasks without any corpus-dependent calibration.

What would settle it

Running the identical 48-byte codec on a fresh sentence-similarity collection and obtaining a macro Pearson correlation noticeably below 0.91 with the dense cosine scores would falsify the preservation claim for the reported operating point.

read the original abstract

Clark Hash is a small method for storing neural embeddings in less space. It normalizes each database vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code. Queries stay in floating point and are scored against the stored sketches. In the default 384-dimensional sentence-embedding setting, Clark Hash stores a cosine-search vector in 48 bytes instead of 1536 bytes for dense f32 storage. This is 32x smaller. The method does not need a training pass, learned codebooks, rotations, or corpus statistics before new vectors can be stored. We describe the codec, the Rust implementation, and a multilingual sentence-similarity evaluation on 9,304 labeled pairs from 29 subsets. With a multilingual MiniLM encoder, the 48-byte sketches reached 0.910 and 0.946 macro Pearson correlation with dense cosine scores on STS17 and STS22. Clark Hash is not a new Johnson-Lindenstrauss theorem and it is not a replacement for approximate nearest-neighbor indexes. It is a simple stateless codec for compact embedding storage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Clark Hash is a straightforward deterministic codec that shrinks 384-dim embeddings to 48 bytes and reports decent Pearson correlation on STS tasks, but it applies known JL tools without new theory or broad testing.

read the letter

The main takeaway is a fixed, stateless recipe—normalize, deterministic sparse signed JL projection, clip, then scalar quantize—that turns a 384-dim vector into 48 bytes while keeping cosine similarity usable on sentence similarity benchmarks. Queries stay in float and compare directly to the sketches. No training or corpus stats required.

The paper earns credit for spelling out the exact steps and giving concrete numbers: 0.910 and 0.946 macro Pearson on the STS17 and STS22 subsets with a multilingual MiniLM encoder, across 9304 pairs. The construction avoids any fitted parameters or self-referential quantities, so the central claim rests on the empirical result rather than circular logic.

The evaluation stays narrow. We see headline correlations but no error bars, no ablation on projection sparsity or clipping values, and no tests on actual retrieval or other embedding workloads. The abstract positions the work explicitly as a practical codec, not a new JL bound or ANN index, which matches the evidence shown.

This is aimed at practitioners who need compact storage for fixed embeddings and can tolerate the similarity trade-off. A reader who wants a reproducible, parameter-free compression option will get something usable from it. The work deserves a serious referee because the method is described clearly enough to implement and the modest claim is checkable against the reported data.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Clark Hash, a stateless codec for compressing neural embeddings. It normalizes each vector, applies a deterministic sparse signed Johnson-Lindenstrauss projection, clips the result, and stores a fixed-width scalar-quantized code (48 bytes for 384-dimensional embeddings). Queries remain in floating point and are scored against the sketches. The central empirical claim is that this procedure achieves macro Pearson correlations of 0.910 on STS17 and 0.946 on STS22 with dense cosine similarities, evaluated on 9,304 labeled pairs from 29 subsets using a multilingual MiniLM encoder, while requiring no training, learned parameters, or corpus statistics.

Significance. If the empirical results hold, the work demonstrates a practical, fully deterministic and stateless method for 32x reduction in embedding storage while retaining sufficient similarity signal for sentence-similarity tasks. Positive aspects include the explicit sequence of operations with no fitted parameters and the provision of a Rust implementation, which supports reproducibility.

major comments (2)

[Evaluation] Evaluation section: the reported macro Pearson correlations of 0.910 and 0.946 are given as single point estimates with no error bars, standard deviations, per-subset breakdowns, or variance estimates across the 29 subsets, and without a full description of the experimental protocol or ablation studies on individual codec components; this renders the support for the central claim that the 48-byte sketches preserve enough cosine information only moderate.
[Method] Method section: the precise construction of the deterministic sparse signed Johnson-Lindenstrauss projection (including the exact sparsity level, how the signing is made deterministic, and the target sketch dimension) is described at a high level but lacks the concrete parameter values or pseudocode needed to reproduce the exact numerical results from the textual description alone.

minor comments (1)

[Abstract] The abstract and introduction could more explicitly contrast Clark Hash with prior quantization and sketching methods to clarify its positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of Clark Hash and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported macro Pearson correlations of 0.910 and 0.946 are given as single point estimates with no error bars, standard deviations, per-subset breakdowns, or variance estimates across the 29 subsets, and without a full description of the experimental protocol or ablation studies on individual codec components; this renders the support for the central claim that the 48-byte sketches preserve enough cosine information only moderate.

Authors: We agree that the evaluation would be strengthened by additional statistical detail. In the revised manuscript we will report standard deviations across the 29 subsets, include per-subset Pearson correlations with error bars, expand the experimental protocol description, and add ablation results on sparsity level, signing determinism, and quantization bit-width. These changes directly address the concern about moderate support for the central claim. revision: yes
Referee: [Method] Method section: the precise construction of the deterministic sparse signed Johnson-Lindenstrauss projection (including the exact sparsity level, how the signing is made deterministic, and the target sketch dimension) is described at a high level but lacks the concrete parameter values or pseudocode needed to reproduce the exact numerical results from the textual description alone.

Authors: We accept this observation. The revised manuscript will specify the exact sparsity (4 non-zeros per column), the deterministic signing procedure (fixed-seed hash function), the target sketch dimension (96), and will include pseudocode for the full projection-plus-quantization pipeline. These additions will enable exact reproduction from the text while retaining the existing Rust implementation as supplementary material. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's construction consists of an explicit, deterministic sequence of operations (vector normalization, fixed sparse signed JL projection, clipping, and scalar quantization to a fixed 48-byte code) with no learned parameters, fitted values, or data-dependent calibration steps. The reported result is a direct empirical Pearson correlation between the resulting sketches and dense cosine scores on held-out STS subsets; this evaluation does not reduce to any self-referential definition, fitted-input prediction, or load-bearing self-citation. The text explicitly states that the method is not a new JL theorem and introduces no uniqueness claims derived from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the standard Johnson-Lindenstrauss lemma for distance preservation and on the arithmetic properties of floating-point normalization and clipping; no free parameters are fitted and no new entities are postulated.

axioms (1)

standard math The Johnson-Lindenstrauss lemma guarantees approximate preservation of inner products under the chosen sparse signed projection.
Invoked to justify that the projected and quantized codes remain useful for cosine scoring.

pith-pipeline@v0.9.1-grok · 5730 in / 1348 out tokens · 39196 ms · 2026-06-29T13:00:16.073910+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

2003
[2]

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation, pages 1–14, 2017

2017
[3]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InProceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, 2002

2002
[4]

A sparse johnson-lindenstrauss transform

Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson-lindenstrauss transform. InProceedings of the 42nd ACM Symposium on Theory of Computing, pages 341–350, 2010

2010
[5]

Gray and David L

Robert M. Gray and David L. Neuhoff. Quantization.IEEE Transactions on Information Theory, 44(6):2325–2383, 1998

1998
[6]

Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

2011
[7]

Johnson and Joram Lindenstrauss

William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.Contemporary Mathematics, 26:189–206, 1984

1984
[8]

Kane and Jelani Nelson

Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.Journal of the ACM, 61(1):4:1–4:23, 2014

2014
[9]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019
[11]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 5776–5788, 2020

2020
[12]

Weinberger, Anirban Dasgupta, John Langford, Alexander J

Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120, 2009. 7

2009

[1] [1]

Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins.Journal of Computer and System Sciences, 66(4):671–687, 2003

2003

[2] [2]

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. InProceedings of the 11th International Workshop on Semantic Evaluation, pages 1–14, 2017

2017

[3] [3]

Finding frequent items in data streams

Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. InProceedings of the 29th International Colloquium on Automata, Languages and Programming, pages 693–703, 2002

2002

[4] [4]

A sparse johnson-lindenstrauss transform

Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse johnson-lindenstrauss transform. InProceedings of the 42nd ACM Symposium on Theory of Computing, pages 341–350, 2010

2010

[5] [5]

Gray and David L

Robert M. Gray and David L. Neuhoff. Quantization.IEEE Transactions on Information Theory, 44(6):2325–2383, 1998

1998

[6] [6]

Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

Hervé Jégou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search.IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117–128, 2011

2011

[7] [7]

Johnson and Joram Lindenstrauss

William B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space.Contemporary Mathematics, 26:189–206, 1984

1984

[8] [8]

Kane and Jelani Nelson

Daniel M. Kane and Jelani Nelson. Sparser johnson-lindenstrauss transforms.Journal of the ACM, 61(1):4:1–4:23, 2014

2014

[9] [9]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. MTEB: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022. 6

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Sentence-BERT: Sentence embeddings using siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using siamese BERT- networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 3982–3992, 2019

2019

[11] [11]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InAdvances in Neural Information Processing Systems, volume 33, pages 5776–5788, 2020

2020

[12] [12]

Weinberger, Anirban Dasgupta, John Langford, Alexander J

Kilian Q. Weinberger, Anirban Dasgupta, John Langford, Alexander J. Smola, and Josh Attenberg. Feature hashing for large scale multitask learning. InProceedings of the 26th Annual International Conference on Machine Learning, pages 1113–1120, 2009. 7

2009