pith. machine review for the scientific record. sign in

arxiv: 2605.09611 · v1 · submitted 2026-05-10 · 💻 cs.CL

Recognition: no theorem link

Byte-Exact Deduplication in Retrieval-Augmented Generation: A Three-Regime Empirical Analysis Across Public Benchmarks

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords byte-exact deduplicationretrieval-augmented generationRAG qualitycontext reductionhuman evaluationdeduplication safetyinference efficiencyempirical analysis
0
0 comments X

The pith

Byte-exact deduplication reduces RAG context size substantially across regimes while introducing no measurable quality regression in four major model vendors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests byte-exact deduplication of chunks in RAG pipelines to see if it preserves output quality. It measures context reduction in clean academic retrieval at 0.16 percent, in enterprise patterns at 24 percent, and in multi-turn conversations at 80 percent. A panel of five judges across vendors uses a five-category protocol to check for materially different answers before and after deduplication. The results show no quality regression, with all models meeting a strict statistical threshold for acceptable differences.

Core claim

Byte-exact deduplication at the chunk level in RAG pipelines introduces zero measurable quality regression while delivering byte reductions of 0.16 percent in clean academic data, 24.03 percent in constructed enterprise patterns, and 80.34 percent in multi-turn conversational AI, as confirmed by a cross-vendor 5-judge panel applying a five-category human-in-the-loop protocol to panel-majority materially different pairs, allowing all four production APIs to clear the Wilson 95 percent upper-bound threshold of 5 percent MAT pairs in both clean and high-redundancy regimes.

What carries the argument

The byte-exact chunk-level deduplication process validated through a cross-vendor 5-judge calibrated panel and five-category human-in-the-loop noise-removal protocol for materially different pairs.

Load-bearing premise

The 5-judge cross-vendor panel combined with the five-category protocol can detect any material quality regression that byte-exact deduplication would introduce.

What would settle it

An independent evaluation using a larger judge pool or alternative protocol that identifies a significant rise in materially different outputs for any vendor after applying byte-exact deduplication.

Figures

Figures reproduced from arXiv: 2605.09611 by Sietse Schelpe.

Figure 1
Figure 1. Figure 1: Byte reduction across five measured corpora spanning the redundancy spectrum, ordered by re￾duction. The clean academic regime (BeIR 22.2M passages, multiplicity ffff1.0; rag-mini-wikipedia panel corpus, ff=1.148) sits at 0.16% to 14.13% byte reduction. The constructed enterprise regime (versioned documents, Q&A boilerplate, 1,526 chunks, ff=1.103) sits at 24.03%. The constructed high-redundancy panel corp… view at source ↗
Figure 2
Figure 2. Figure 2: Cross-vendor 5-judge calibrated panel quality validation: post-audit MAT rate after five-category human-in-the-loop noise removal (Section 4.5a). Per-vendor Wilson 95% upper confidence bound on MAT clas￾sification rate, shown for both regimes: clean (n=400 per vendor, 14.13% byte reduction) and high-redundancy (nff200 per vendor, 71.98% byte reduction). All four production vendors clear the strict 5% pre-r… view at source ↗
read the original abstract

This preprint presents an empirical analysis of byte-exact chunk-level deduplication in Retrieval-Augmented Generation (RAG) pipelines. We measure context reduction across three distinct operating regimes: clean academic retrieval (0.16% byte reduction on 22.2M BeIR passages), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). To validate quality preservation, we conducted a cross-vendor 5-judge calibrated panel evaluation across four production APIs (Google Gemini 2.5 Flash, Anthropic Claude Sonnet 4.6, Meta Llama 3.3 70B, and OpenAI GPT-5.1). Applying a five-category human-in-the-loop noise-removal protocol to panel-majority materially different (MAT) pairs, we establish that byte-exact deduplication introduces zero measurable quality regression. Post-audit, all four vendors clear the strict <5% Wilson 95% upper-bound MAT threshold in both the clean and high-redundancy RAG regimes. This work demonstrates that substantial inference compute savings can be achieved deterministically without compromising evaluation-grade model quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents an empirical analysis of byte-exact chunk-level deduplication in RAG pipelines across three regimes: clean academic retrieval on 22.2M BeIR passages (0.16% byte reduction), constructed enterprise patterns (24.03% reduction), and multi-turn conversational AI (80.34% reduction). Quality preservation is assessed via a cross-vendor 5-judge calibrated human panel using a five-category noise-removal protocol on panel-majority materially different (MAT) pairs. The central claim is that deduplication introduces zero measurable quality regression, with all four production APIs (Gemini 2.5 Flash, Claude Sonnet 4.6, Llama 3.3 70B, GPT-5.1) clearing the strict <5% Wilson 95% upper-bound MAT threshold in both clean and high-redundancy regimes.

Significance. If the human evaluation protocol proves sufficiently sensitive, the work provides direct evidence that deterministic byte-exact deduplication can yield substantial context reduction and inference savings in RAG without detectable quality loss. The multi-regime design and cross-vendor panel evaluation add empirical breadth, and the use of a structured human-in-the-loop audit with a strict statistical bound is a positive step toward reproducible quality assessment in production settings.

major comments (1)
  1. [Human Evaluation / MAT Protocol] The zero-regression conclusion rests entirely on the post-audit absence of panel-majority MAT pairs under the five-category protocol. The manuscript provides no inter-rater agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates), no details on the calibration procedure for the 5 judges, and no power analysis or validation that the protocol would reliably flag subtle regressions such as omission of a single critical fact. This is especially material in the clean regime (0.16% reduction), where the Wilson bound's informativeness depends on adequate sample size and detection sensitivity.
minor comments (2)
  1. [Abstract and Experimental Setup] Model version strings (e.g., 'GPT-5.1', 'Gemini 2.5 Flash') should be accompanied by exact API identifiers or dates of access to ensure reproducibility.
  2. [Regime Definitions] The three regimes are described at a high level; explicit chunking rules, overlap parameters, and how the 'constructed enterprise patterns' were generated would aid replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and for highlighting the need for greater transparency in our human evaluation protocol. We address the major comment below and have incorporated additional details and analyses into the revised manuscript.

read point-by-point responses
  1. Referee: The zero-regression conclusion rests entirely on the post-audit absence of panel-majority MAT pairs under the five-category protocol. The manuscript provides no inter-rater agreement statistics (e.g., Fleiss' kappa or pairwise agreement rates), no details on the calibration procedure for the 5 judges, and no power analysis or validation that the protocol would reliably flag subtle regressions such as omission of a single critical fact. This is especially material in the clean regime (0.16% reduction), where the Wilson bound's informativeness depends on adequate sample size and detection sensitivity.

    Authors: We agree that these elements strengthen the credibility of the human evaluation and should have been reported explicitly. In the revised version we now include: (1) Fleiss' kappa and pairwise agreement rates computed across the five judges on the full set of audited pairs; (2) a step-by-step description of the calibration procedure, including the training materials, anchor examples, and iterative alignment sessions used to standardize the five-category noise-removal protocol; and (3) a post-hoc power justification that references the observed number of evaluated pairs, the Wilson 95% upper-bound threshold of <5%, and the protocol's design to surface material differences (including single-fact omissions) via the MAT definition. These additions are placed in a new subsection of the evaluation methodology and do not alter the original empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical measurements with direct observations

full rationale

The paper reports direct empirical measurements of byte reduction across three RAG regimes and a 5-judge human panel audit using a five-category MAT protocol to assess quality regression. No equations, derivations, fitted parameters, or self-referential claims appear in the abstract or described methodology. The zero-regression conclusion is presented as a post-audit observation (absence of panel-majority MAT pairs below the Wilson bound), not as a reduction of any output to its inputs by construction. No self-citations, uniqueness theorems, or ansatzes are invoked to support the core results. This is a standard empirical study whose claims rest on experimental data rather than circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard domain assumptions about deduplication and evaluation rather than new free parameters or invented entities.

axioms (2)
  • domain assumption Byte-exact deduplication removes only chunks identical at the byte level.
    Core definition used throughout the title and abstract.
  • domain assumption The five-category human-in-the-loop MAT protocol with Wilson bound accurately identifies material quality differences.
    Load-bearing for the zero-regression claim.

pith-pipeline@v0.9.0 · 5507 in / 1319 out tokens · 47804 ms · 2026-05-12T04:14:27.660350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 5 internal anchors

  1. [1]

    Lewis et al

    P. Lewis et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks . NeurIPS 2020. Merlin Paper Portfolio 25

  2. [2]

    & Qiu, L

    H. Jiang et al. LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models. EMNLP 2023. arXiv:2310.05736

  3. [3]

    Llmlingua-2: Data distillation for efficient and faithful task-agnostic prompt compression.arXiv preprint arXiv:2403.12968, 2024

    Z. Pan et al. LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression. ACL 2024. arXiv:2403.12968

  4. [4]

    Jiang, Q

    H. Jiang et al. LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression. arXiv:2310.06839

  5. [5]

    Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference

    C. Kummer, L. Jurkschat, M. Färber, S. Vahdati. Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference . arXiv:2604.02985

  6. [6]

    X. Lin, A. Ghosh, B. K. H. Low, A. Shrivastava, V. Mohan. REFRAG: Rethinking RAG-based Decoding. arXiv:2509.01092

  7. [7]

    ContextPilot: Fast Long-Context Inference via Context Reuse

    Y. Jiang, Y. Huang, L. Cheng, C. Deng, X. Sun, L. Mai. RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse . arXiv:2511.03475

  8. [8]

    Y. Liu, Z. Jia, X. Gao, K. Xu, Y. Xiong. Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective . arXiv:2602.15856

  9. [9]

    K. Lee, D. Ippolito, A. Nystrom, C. Zhang, D. Eck, C. Callison-Burch, N. Carlini. Deduplicating Training Data Makes Language Models Better . ACL 2022. arXiv:2107.06499

  10. [10]

    Quantifying Memorization Across Neural Language Models

    N. Carlini, D. Ippolito, M. Jagielski, K. Lee, F. Tramer, C. Zhang. Quantifying Memorization Across Neural Language Models . arXiv:2202.07646

  11. [11]

    M. Nasr, N. Carlini et al. Scalable Extraction of Training Data from (Production) Language Models. arXiv:2311.17035

  12. [12]

    Shilov, M

    I. Shilov, M. Meeus, Y.-A. de Montjoye. The Mosaic Memory of Large Language Models . Nature Communications, 2026. arXiv:2405.15523

  13. [13]

    A. Z. Broder. On the Resemblance and Containment of Documents . SEQUENCES 1997

  14. [14]

    Khan et al

    A. Khan et al. LSHBloom: Memory-efficient, Extreme-scale Document Deduplication . arXiv:2411.04257

  15. [15]

    Abbas, K

    A. Abbas, K. Tirumala, D. Simig, S. Ganguli, A. S. Morcos. SemDeDup: Data-efficient Learning at Web-scale through Semantic Deduplication . arXiv:2303.09540

  16. [16]

    Tirumala et al

    K. Tirumala et al. D4: Improving LLM Pretraining via Document De-Duplication and Diversifi- cation. NeurIPS Datasets and Benchmarks 2023. arXiv:2308.12284

  17. [17]

    Training Sentence Transformers with Multiple Negatives Ranking Loss

    Pinecone. Training Sentence Transformers with Multiple Negatives Ranking Loss. https://www.pinecone.io/learn/series/nlp/train- sentence-transformers-multiple-negatives-ranking-loss/

  18. [18]

    MinHash LSH Index Documentation

    Milvus. MinHash LSH Index Documentation . https://milvus.io/docs/minhash_lsh.md

  19. [19]

    Solving RAG Accuracy Through Data Optimization

    Blockify. Solving RAG Accuracy Through Data Optimization . https://blockify.ai/

  20. [20]

    Lee, S.-w

    D. Lee, S.-w. Hwang, K. Lee, S. Choi, S. Park. On Complementarity Objectives for Hybrid Retrieval. ACL 2023, pages 13357-13368

  21. [21]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    G. Penedo et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale . NeurIPS 2024. arXiv:2406.17557

  22. [22]

    Regulation 2024/1689: Artificial Intelligence Act, Article 12

    European Union. Regulation 2024/1689: Artificial Intelligence Act, Article 12 . Merlin Paper Portfolio 26

  23. [23]

    Securities and Exchange Commission

    U.S. Securities and Exchange Commission. 17 CFR § 240.17a-4: Records to be preserved by certain exchange members, brokers and dealers

  24. [24]

    MedGemma Technical Report

    A. Sellergren et al. MedGemma Technical Report. arXiv:2507.05201

  25. [25]

    Bondarenko, A

    A. Bondarenko, A. Viehweger. LLM Robustness Against Misinformation in Biomedical Question Answering. arXiv:2410.21330

  26. [26]

    Financebench: A new benchmark for financial question answering.arXiv preprint arXiv:2311.11944, 2023

    P. Islam et al. FinanceBench: A New Benchmark for Financial Question Answering . arXiv:2311.11944

  27. [27]

    Guha et al

    N. Guha et al. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models . NeurIPS 2023 Datasets and Benchmarks. arXiv:2308.11462

  28. [28]

    J. R. Landis, G. G. Koch. The Measurement of Observer Agreement for Categorical Data . Bio- metrics 33(1):159-174, 1977

  29. [29]

    Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference

    Sietse Schelpe. Merlin: Deterministic Byte-Exact Deduplication for Lossless Context Optimization in Large Language Model Inference . Companion paper, arXiv ID pending. Appendix A: Run Identifiers and Reproducibility Per-call telemetry is archived under run identifiers fixing the precise version of each benchmark. The runs referenced in Section 4 are dated...