arxiv: 2605.09990 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

Sietse Schelpe

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords deduplicationLLM inferencecontext optimizationbyte-exact matchinghash setRAG pipelinesdata compressiontext processing

0 comments

The pith

Merlin performs byte-exact deduplication on LLM text inputs using an optimized hash set to reduce context size while preserving every original byte.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Merlin as a local deduplication engine that identifies and removes exact duplicate byte sequences from text corpora. It targets inefficiencies in retrieval-augmented generation and other data pipelines feeding large language models. A sympathetic reader would care because redundant passages inflate input sizes, increase processing costs, and slow inference without adding new information. The approach claims to deliver measurable reductions in input length while guaranteeing no data loss through deterministic matching.

Core claim

Merlin is a local-first, agnostic deduplication engine that uses a SIMD-friendly open-addressing flat hash set paired with xxHash3-64 to perform rapid byte-exact deduplication of text passages and chunks. When applied to LLM contexts such as RAG pipelines, it reduces input sizes from 13.9 percent in low-redundancy cases to over 71 percent in high-redundancy ones while maintaining absolute data fidelity and sustaining throughputs up to 8.7 GB/s. Integration occurs through the Model Context Protocol for secure deployment in IDEs and agents.

What carries the argument

The SIMD-friendly open-addressing flat hash set combined with xxHash3-64, which stores unique byte sequences for deterministic exact-match deduplication of text chunks.

If this is right

LLM input contexts can shrink by 14 to 71 percent depending on the redundancy level present in the source data.
Processing pipelines for retrieval and data ingestion maintain full original fidelity while running at up to 8.7 GB/s.
Secure zero-network local deployment becomes possible inside IDEs and autonomous agents via the Model Context Protocol.
The same engine applies to any text-processing workflow beyond LLMs, including general retrieval systems.
Redundant text no longer inflates memory and compute demands during large-model inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption could standardize lossless context compression as a preprocessing step before model calls.
The same technique might extend to streaming or incremental deduplication in live data pipelines.
If the speed and fidelity claims hold at scale, downstream inference latency and cost could decrease in proportion to the measured input reductions.
Integration with existing tokenizers might allow deduplication to operate at the token level as well as the byte level.

Load-bearing premise

That the hash-set implementation actually detects and removes only true duplicate byte sequences without collisions or unintended data alteration.

What would settle it

A test case in which two non-identical byte sequences are treated as duplicates, or where the output after deduplication differs from the original input in any byte position.

Figures

Figures reproduced from arXiv: 2605.09990 by Sietse Schelpe.

**Figure 1.** Figure 1: Architectural commitments of the engine. Bytes flow through five stages: input ingestion, fingerprinting with a high-entropy primitive paired with a deterministic byte-verification fallback on collision, indexing into an L2-aligned memory arena with zero-allocation hot paths, lock-free deterministic dispatch across workers, and emission preserving first-occurrence order. The diagram illustrates architectu… view at source ↗

**Figure 2.** Figure 2: Per-call latency envelope on a logarithmic scale. The dedup loop’s pure work falls in the 1 to 30 microsecond range. Subprocess invocation overhead (13 to 21 ms) is operating-system level, unrelated to the engine itself, and is eliminated under in-process integration. Typical inference-proxy preprocessing budgets, TTFT, and full inference-call latencies are shown for context. Approximately four orders of m… view at source ↗

**Figure 3.** Figure 3: Forest plot of all 40 primary evaluation cells. Each dot shows the per-cell quality delta (deduped minus raw) for one (vendor, benchmark) combination across four production language-model APIs and three benchmark families. The grey band marks the test-retest noise floor. All 40 cells fall within plus or minus 5 percentage points; zero cells are statistically significant after Bonferroni correction at alpha… view at source ↗

read the original abstract

Data-intensive applications, ranging from large-scale retrieval systems to advanced data pipelines, are increasingly bottlenecked by the processing of highly redundant text corpora. We present Merlin, a local-first, agnostic, high-throughput deduplication and context optimization engine designed to mitigate these inefficiencies. Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks. While broadly applicable to any text-processing workflow, its impact is particularly pronounced in Large Language Model (LLM) ecosystems, such as Retrieval-Augmented Generation (RAG). Our empirical evaluations demonstrate an input reduction ranging from 13.9% in low-redundancy datasets to over 71% in high-redundancy pipelines, maintaining absolute data fidelity. Furthermore, we detail the system's integration architecture via the Model Context Protocol (MCP), enabling secure, zero-network-interception deployment across major IDEs and autonomous agents. This paper outlines the core algorithmic design, performance benchmarks, and the architectural principles required to process data at sustained speeds of up to 8.7 GB/s.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript presents Merlin, a local-first deduplication engine that employs a SIMD-optimized open-addressing flat hash set keyed by xxHash3-64 to perform byte-exact deduplication of text passages and chunks. It claims input reductions ranging from 13.9% on low-redundancy datasets to over 71% on high-redundancy pipelines while preserving absolute data fidelity, with sustained throughput up to 8.7 GB/s. The work positions the system for LLM context optimization (especially RAG) and describes integration via the Model Context Protocol (MCP) for secure local deployment.

Significance. If the byte-exactness and performance claims hold after verification, Merlin could deliver practical efficiency improvements for redundant text processing in LLM inference and data pipelines by shrinking context without fidelity loss. The high reported throughput and local-first MCP integration would add value for real-world RAG and agent deployments.

major comments (2)

Abstract: The claim of 'byte-exact deduplication' and 'maintaining absolute data fidelity' is load-bearing but unsupported by the given description. The design uses an open-addressing hash set with 64-bit xxHash3-64 keys; without explicit storage of full byte sequences and memcmp-style verification on every hash hit, collisions can cause distinct chunks to be incorrectly merged, violating the lossless guarantee. No such collision-handling step is mentioned.
Performance benchmarks section: The abstract states concrete results (13.9%–71% reduction, 8.7 GB/s) yet provides no dataset descriptions, methodology, collision-rate analysis, or fidelity-verification procedure. This absence prevents assessment of reproducibility and directly weakens the empirical foundation for the central claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. We appreciate the identification of areas where the description of byte-exact deduplication and the supporting benchmark details require clarification and expansion. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract: The claim of 'byte-exact deduplication' and 'maintaining absolute data fidelity' is load-bearing but unsupported by the given description. The design uses an open-addressing hash set with 64-bit xxHash3-64 keys; without explicit storage of full byte sequences and memcmp-style verification on every hash hit, collisions can cause distinct chunks to be incorrectly merged, violating the lossless guarantee. No such collision-handling step is mentioned.

Authors: We acknowledge that the current manuscript description does not explicitly detail the collision-handling mechanism, which is necessary to rigorously support the byte-exact claim. In the implementation, the open-addressing hash set stores both the 64-bit xxHash3-64 key and the full original byte sequence for each chunk. On any hash hit, a byte-wise comparison (memcp-style) is performed to confirm exact identity before deduplication occurs; only verified matches result in deduplication. This design eliminates the possibility of erroneous merging due to hash collisions. We will revise the abstract and add a dedicated subsection (with pseudocode) in the algorithmic design section to describe the storage and verification steps explicitly. revision: yes
Referee: Performance benchmarks section: The abstract states concrete results (13.9%–71% reduction, 8.7 GB/s) yet provides no dataset descriptions, methodology, collision-rate analysis, or fidelity-verification procedure. This absence prevents assessment of reproducibility and directly weakens the empirical foundation for the central claims.

Authors: We agree that the benchmarks section is currently underspecified and does not provide sufficient information for reproducibility or verification of the reported results. In the revised manuscript we will expand this section to include: (1) detailed descriptions of all datasets, their sizes, sources, and redundancy characteristics; (2) the complete experimental methodology, hardware configuration, and measurement procedures for throughput and reduction ratios; (3) observed hash collision rates together with the verification method used to confirm they did not affect results; and (4) the fidelity-verification procedure (round-trip reconstruction of the original input followed by byte-level diff against the deduplicated output). These additions will directly strengthen the empirical foundation. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on benchmarks without self-referential derivations

full rationale

The paper describes a practical deduplication system using an open-addressing hash set with xxHash3-64 and reports measured input reductions (13.9% to 71%) and throughput (up to 8.7 GB/s) from evaluations. No equations, fitted parameters, predictions derived from inputs, self-citations as load-bearing premises, or ansatzes appear in the abstract or described content. The byte-exact fidelity claim is presented as a property of the implementation rather than a result that reduces to its own definition by construction. This is a standard engineering report whose central assertions are externally falsifiable via the reported benchmarks and do not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems and engineering paper. No free parameters, mathematical axioms, or invented scientific entities are introduced; design relies on standard techniques such as open-addressing hash tables and xxHash.

pith-pipeline@v0.9.0 · 5503 in / 1152 out tokens · 58212 ms · 2026-05-12T04:07:06.451355+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Utilizing a highly optimized, SIMD-friendly open-addressing flat hash set combined with xxHash3-64, Merlin performs rapid, byte-exact deduplication of text passages and data chunks.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The byte-exact equivalence relation c_i equiv_B c_j iff L_i = L_j and for all k ...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, N. Carlini. Deduplicating Training Data Makes Language Models Better . ACL 2022. arXiv:2107.06499

work page arXiv 2022
[2]

Quantifying Memorization Across Neural Language Models

N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, C. Zhang. Quantifying Memorization Across Neural Language Models . arXiv:2202.07646

work page internal anchor Pith review arXiv
[3]

M. Nasr, N. Carlini et al. Scalable Extraction of Training Data from (Production) Language Models . arXiv:2311.17035

work page arXiv
[4]

Ahmed, A

A. Ahmed, A. F. Cooper, S. Koyejo, P. Liang. Extracting Books from Production Language Models . arXiv:2601.02671

work page arXiv
[5]

Shilov, M

I. Shilov, M. Meeus, Y.-A. de Montjoye. The Mosaic Memory of Large Language Models . Nature Communications 2026. arXiv:2405.15523

work page arXiv 2026
[6]

A. Z. Broder. On the Resemblance and Containment of Documents . SEQUENCES 1997

work page 1997
[7]

Khan et al

A. Khan et al. LSHBloom: Memory-eﬀicient, Extreme-scale Document Deduplication . arXiv:2411.04257

work page arXiv
[8]

Abbas, K

A. Abbas, K. Tirumala, D. Simig, S. Ganguli, A. S. Morcos. SemDeDup: Data-eﬀicient Learning at Web-scale through Semantic Deduplication . arXiv:2303.09540

work page arXiv
[9]

Tirumala et al

K. Tirumala et al. D4: Improving LLM Pretraining via Document De-Duplication and Diversifica- tion. NeurIPS Datasets and Benchmarks 2023. arXiv:2308.12284

work page arXiv 2023
[10]

He et al

N. He et al. SoftDedup: an Eﬀicient Data Reweighting Method for Speeding Up Language Model Pre-training. ACL 2024. arXiv:2407.06654

work page arXiv 2024
[11]

X. Lin, A. Ghosh, B. K. H. Low, A. Shrivastava, V. Mohan. REFRAG: Rethinking RAG-based Decoding. arXiv:2509.01092

work page arXiv
[12]

ContextPilot: Fast Long-Context Inference via Context Reuse

Y. Jiang, Y. Huang, L. Cheng, C. Deng, X. Sun, L. Mai. RAGBoost: Eﬀicient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse . arXiv:2511.03475

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Y. Liu, Z. Jia, X. Gao, K. Xu, Y. Xiong. Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective . arXiv:2602.15856

work page arXiv
[14]

Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

C. Kummer, L. Jurkschat, M. Färber, S. Vahdati. Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference . arXiv:2604.02985

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Udayashankar, A

S. Udayashankar, A. Baba, S. Al-Kiswany. Accelerating Data Chunking in Deduplication Systems using Vector Instructions (VectorCDC). USENIX F AST ’25. arXiv:2508.05797

work page arXiv
[16]

H. Ji, M. Kim, S. Oh, D. Kim, N. S. Kim. Para-ksm: Parallelized Memory Deduplication with Data Streaming Accelerator. USENIX ATC ’25

work page
[17]

A. Levi, P. Shilane, S. Sheinvald, G. Yadgar. Physical vs. Logical Indexing with IDEA: Inverted Deduplication-Aware Index. USENIX F AST ’24

work page
[18]

Don ’t Maintain Twice, It’s Alright: Merged Metadata Management in Deduplication File System with GogetaFS

Pan et al. Don ’t Maintain Twice, It’s Alright: Merged Metadata Management in Deduplication File System with GogetaFS . USENIX F AST ’25

work page
[19]

Chursin, L

A. Chursin, L. Kokoris-Koglis, A. Orlov, A. Sonnino, I. Zablotchi. Tidehunter: Large-Value Storage With Minimal Data Relocation . arXiv:2602.01873. Merlin Paper Portfolio 26

work page arXiv
[20]

Zipllm: Efficient llm storage via model-aware synergistic data deduplication and compression.arXiv preprint arXiv:2505.06252, 2025

Z. Wang et al. ZipLLM: Eﬀicient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression . USENIX NSDI ’26. arXiv:2505.06252

work page arXiv
[21]

J. C. Corbett et al. Spanner: Google’s Globally-Distributed Database . OSDI 2012

work page 2012
[22]

Bronson et al

N. Bronson et al. TAO: Facebook’s Distributed Data Store for the Social Graph . USENIX ATC 2013

work page 2013
[23]

Verbitski et al

A. Verbitski et al. Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. SIGMOD 2017

work page 2017
[24]

J. R. Landis, G. G. Koch. The Measurement of Observer Agreement for Categorical Data . Bio- metrics 33(1):159-174, 1977

work page 1977
[25]

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

G. Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale . NeurIPS 2024. arXiv:2406.17557. Appendix A: Data Sources and Reproducibility All data sources are public and peer-reviewable. Benchmark scripts, dataset manifests, and harness configurations for the measurements reported in Section 4 are documented in the companio...

work page internal anchor Pith review arXiv 2024
[26]

Run generate_dataset.py with seed = 42 to regenerate the synthetic dataset (verify SHA-256 matches the value above)

work page
[27]

Run any byte-exact deduplication filter over the chunk multiset

work page
[28]

unique={len(unique)} duplicate={total - len(unique)}

Verify the unique-count equals 100,000 and duplicate-count equals 100,000 Python reference implementation (the math-equivalent reproducibility path): import json unique = set() total = 0 for line in open('synthetic_dataset.jsonl'): item = json.loads(line) unique.add(item['text']) total += 1 print(f"unique={len(unique)} duplicate={total - len(unique)}") Ex...

work page 2026