pith. sign in

arxiv: 2605.24541 · v1 · pith:3HNIKVINnew · submitted 2026-05-23 · 💻 cs.LG · cs.AI· cs.CL· cs.IR

SemanticZip: A Pilot Framework for Lossy Text Compression with LLMs as Semantic Decompressors

Pith reviewed 2026-06-30 14:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.IR
keywords semantic compressionlossy compressionLLM decompressiontext codecssemantic atomsprotected packetsrecoverabilitytoken gain
0
0 comments X

The pith

SemanticZip introduces lossy text compression where LLMs decompress compact codes to recover task-relevant semantic commitments while protecting exact parts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SemanticZip as a pilot framework for compressing text into compact codes that an LLM can later expand back into meaningful content for a given task. Unlike standard compression or summarization, it evaluates recovery of semantic atoms using an independent decoder LLM and distinguishes between protected exact text and lossy semantic packets. Six representation regimes are tested on five diagnostic cases, showing varying levels of recall and token savings, with structured prose performing best on recovery and SemanticZip ASCII on compression. The authors emphasize that this is an experimental interface rather than a performance benchmark, highlighting a principle for deciding what to compress.

Core claim

The central claim is that LLM-mediated decompression can be formalized using a protected/lossy packet architecture, allowing evaluation of recoverability across representation regimes, and that the viable approach is to keep safety-critical commitments protected while semantically zipping predictable low-risk context.

What carries the argument

The protected/lossy packet architecture that separates exact safety-critical text from compressible semantic codes decompressed by an LLM.

If this is right

  • Task-relevant semantic commitments can be recovered with weighted atom recall above 0.8 in several tested regimes.
  • Token reductions of 19% to 46% are achievable depending on the representation chosen.
  • Safety-critical text remains unchanged while low-risk context is replaced by compact codes.
  • The framework provides metrics like Critical Atom Recall and tokenizer gain for comparing different codes.
  • Different formats such as prose, JSON, and emoji yield distinct trade-offs between compression and fidelity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying the protected/lossy split to real documents like reports or logs could reduce storage while maintaining key facts.
  • Future tests on diverse, non-author-constructed texts would check if the observed recoverability holds more broadly.
  • The method might integrate with retrieval systems where only semantic summaries are stored for context.
  • Risk assessment models could automatically decide which parts of text to protect versus zip.

Load-bearing premise

The five author-constructed diagnostic cases are representative enough to evaluate recoverability of task-relevant semantic commitments across the six representation regimes.

What would settle it

If evaluations on a broader set of independently created documents show that no representation regime achieves weighted atom recall above 0.7 while providing meaningful token savings, the practical utility of the framework would be called into question.

Figures

Figures reproduced from arXiv: 2605.24541 by Natalia Trukhina, Vadim Vashkelis.

Figure 1
Figure 1. Figure 1: Pilot compression–recoverability tradeoff over five author-constructed cases. SemanticZip ASCII [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

Text compression for large language model (LLM) systems is usually framed as token deletion, retrieval, summarization, or exact reconstruction. We study a more aggressive but explicitly lossy setting: compress text into compact codes that an LLM can expand into task-relevant meaning. We call this setting SemanticZip. Unlike lossless compression, SemanticZip does not require byte-identical reconstruction; unlike ordinary summarization, it treats model-based decompression as part of the codec and evaluates whether task-relevant semantic commitments are recovered. This paper is a pilot framework, not a benchmark claim. We formalize LLM-mediated decompression, define a protected/lossy packet architecture, and evaluate six representation regimes over five author-constructed diagnostic cases: structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, and SemanticZip emoji. An independent decoder LLM reconstructs typed semantic atoms from each compressed representation, and we score Critical Atom Recall, Weighted Atom Recall, precision, and tokenizer gain. In this pilot, structured prose has the highest recoverability, with WAR = 0.956 and 19.1% o200k_base token gain. CCL-Min is the strongest balanced point, with 39.4% token gain and WAR = 0.874. SemanticZip ASCII provides the largest useful compression, with 46.5% token gain and WAR = 0.802, while emoji-heavy SemanticZip performs worse on both compression and recovery. The main contribution is not the claim that these numbers establish a universal frontier. Rather, we introduce a reproducible experimental interface for studying lossy, LLM-decompressible text codes and a design principle: safety-critical and exact commitments should remain protected, while predictable low-risk context may be semantically zipped.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims to introduce SemanticZip, a pilot framework for lossy text compression where LLMs serve as semantic decompressors. It formalizes the decompression process, defines a protected/lossy packet architecture, and evaluates six representation regimes (structured prose, JSON, CCL-Core, CCL-Min, SemanticZip ASCII, SemanticZip emoji) across five author-constructed diagnostic cases. Metrics reported include Critical Atom Recall, Weighted Atom Recall (WAR), precision, and tokenizer gain. Structured prose achieves the highest WAR of 0.956 with 19.1% token gain, while SemanticZip ASCII offers 46.5% gain with WAR 0.802. The primary contribution is positioned as a reproducible experimental interface and the design principle that safety-critical commitments should be protected while low-risk context can be semantically zipped.

Significance. If the framework holds, it offers a novel approach to studying lossy compression tailored for LLM systems by treating decompression as part of the codec. This could be significant for developing efficient context management strategies in LLM applications. The work explicitly frames itself as a pilot without claiming generalizability, which strengthens its position. Credit is given for providing a reproducible experimental interface and a clear design principle separating protected and lossy packets. The evaluation on diagnostic cases demonstrates the concept without overclaiming.

minor comments (2)
  1. [Abstract] Abstract: the notation '19.1% o200k_base token gain' is unclear; specify the tokenizer or base model referenced (e.g., gpt-4o-200k).
  2. Ensure acronyms such as CCL-Core and CCL-Min are expanded on first use in the main body, as they are not defined in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the work as a pilot framework, for crediting the reproducible experimental interface and the protected/lossy packet design principle, and for recommending minor revision. No specific major comments were listed in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a pilot framework that introduces a reproducible experimental interface and a protected/lossy packet design principle. It evaluates six representation regimes on five author-constructed diagnostic cases using explicitly defined metrics (Critical Atom Recall, WAR, etc.) without equations, derivations, fitted-parameter predictions, or self-citations. The contribution is scoped to the interface and principle rather than any generalization claim, so no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The paper is a pilot framework that introduces new terminology and an experimental setup without mathematical derivations, fitted parameters, or standard axioms beyond the basic experimental design.

invented entities (1)
  • protected/lossy packet architecture no independent evidence
    purpose: To separate safety-critical exact commitments from compressible low-risk context in the compression scheme
    Newly defined as part of the framework with no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.1-grok · 5860 in / 1223 out tokens · 66427 ms · 2026-06-30T14:41:50.914290+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Schmidt, Jesse Spencer-Smith, and Jules White

    Henry Gilbert, Michael Sandborn, Douglas C. Schmidt, Jesse Spencer-Smith, and Jules White. Semantic compression with large language models. arXiv preprint arXiv:2304.12512, 2023. https://arxiv. org/abs/2304.12512

  2. [2]

    Language Modeling Is Compression

    Gr´egoire Del´etang, Anian Ruoss, Paul-Ambroise Duquenne, Elliot Catt, Tim Genewein, Christopher Mattern, Jordi Grau-Moya, Li Kevin Wenliang, Matthew Aitchison, Laurent Orseau, Marcus Hutter, and Joel Veness. Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023. https: //arxiv.org/abs/2309.10668

  3. [3]

    LLMLingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models. InProceedings of EMNLP, 2023. https: //aclanthology.org/2023.emnlp-main.825/

  4. [4]

    LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression

    Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression. InProceedings of ACL, 2024.https://arxiv.org/abs/2310.06839

  5. [5]

    Selective Context: Compress your input to ChatGPT or other LLMs

    Yucheng Li. Selective Context: Compress your input to ChatGPT or other LLMs. 2023. https: //github.com/liyucheng09/Selective_Context

  6. [6]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. https://aclanthology.org/2024.tacl-1. 9/

  7. [7]

    Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt¨aschel, Sebastian Riedel, and Douwe Kiela. 12 Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems, 2020.https://arxiv.org/abs/2005.11401

  8. [8]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Vivian Fang, Shishir G. Patil, Kevin Lin, Sarah Wooders, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023. https: //arxiv.org/abs/2310.08560

  9. [9]

    Compress the Context, Keep the Commitments: A Formal Framework for Verifiable LLM Context Compression

    Natalia Trukhina and Vadim Vashkelis. Compress the context, keep the commitments: A formal framework for verifiable LLM context compression. arXiv preprint arXiv:2605.17304, 2026. https: //arxiv.org/abs/2605.17304

  10. [10]

    JSON Schema validation: A vocabulary for structural valida- tion of JSON

    JSON Schema Organization. JSON Schema validation: A vocabulary for structural valida- tion of JSON. Draft 2020-12, 2020. https://json-schema.org/draft/2020-12/ json-schema-validation

  11. [11]

    JavaScript Object Notation (JSON) Patch

    Paul Bryan and Mark Nottingham. JavaScript Object Notation (JSON) Patch. RFC 6902, Internet Engineering Task Force, 2013.https://datatracker.ietf.org/doc/html/rfc6902

  12. [12]

    Tuning the decision threshold for class prediction

    Scikit-learn developers. Tuning the decision threshold for class prediction. scikit- learn User Guide. https://scikit-learn.org/stable/modules/classification_ threshold.html

  13. [13]

    Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith. Show your work: Improved reporting of experimental results.EMNLP-IJCNLP, 2019. See also NLP Reproducibility Checklist. https://www.jessedodge.ai/NLP_Reproducibility_Checklist_V1.2. pdf

  14. [14]

    Reproducibility in NLP: What have we learned from the checklist?Findings of ACL, 2023

    Ian Magnusson, Akshita Bhagia, Valentin Hofmann, Luca Soldaini, Oyvind Tafjord, Peter West, Kyle Lo, Dirk Groeneveld, Kyle Richardson, Ashish Sabharwal, Iz Beltagy, and Jesse Dodge. Reproducibility in NLP: What have we learned from the checklist?Findings of ACL, 2023. https://aclanthology. org/2023.findings-acl.809/

  15. [15]

    Beyond accuracy: Behav- ioral testing of NLP models with CheckList.ACL, 2020

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behav- ioral testing of NLP models with CheckList.ACL, 2020. https://aclanthology.org/2020. acl-main.442/. 13