arxiv: 2605.04426 · v1 · submitted 2026-05-06 · 💻 cs.CL

Recognition: unknown

Telegraph English: Semantic Prompt Compression via Structured Symbolic Rewriting

Mikhail L. Arbuzov , Sisong Bei , Ziwei Dong , Dmitri Kalaev , Alexey A. Shvets

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt compressionsemantic rewritingsymbolic representationtoken reductionlong contextfact decompositionstructured promptsatomic facts

0 comments

The pith

Rewriting prompts into atomic facts with logical symbols cuts tokens by half while retaining 99.1 percent accuracy on key facts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that natural language can be compressed by rewriting it into Telegraph English, a structured dialect that breaks input into atomic fact lines and replaces phrases with about forty logical and relational symbols. This rewrite adapts its compression ratio to the document's information density and makes each output line a self-contained fact that also serves as an index entry. A reader would care because the approach delivers higher downstream accuracy than token-deletion compressors at the same length, with the largest gains appearing on smaller models that otherwise lose fine details. Evaluation on more than four thousand long-context question-answer pairs supports these outcomes across five different language models.

Core claim

Telegraph English decomposes the input into atomic fact lines and substitutes verbose natural-language phrases with a fixed inventory of logical and relational symbols. The resulting line structure causes compression and semantic chunking to become the same operation, so the compressed form is simultaneously a compact prompt and an addressable index of facts. At roughly 50 percent token reduction this representation preserves 99.1 percent accuracy on key facts with GPT-4.1 and exceeds the performance of LLMLingua-2 at matched ratios on every model and task examined, with the margin growing to as much as 11 points on fine-detail questions when smaller models are used.

What carries the argument

Atomic-fact line structure combined with substitution by a fixed set of roughly forty logical and relational symbols, which simultaneously reduces length and creates an independent semantic index.

If this is right

At 50 percent token reduction TE retains 99.1 percent accuracy on key facts with GPT-4.1.
TE exceeds LLMLingua-2 accuracy at every matched compression ratio across all five models and both difficulty levels.
The accuracy advantage widens on smaller models, reaching 11 percentage points on tasks that require fine detail.
Each compressed line functions as an independently addressable fact, turning the output into a semantic index as a direct byproduct.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-structuring prompts in this symbolic form could let capacity-limited models handle longer contexts by off-loading relational parsing to the rewrite step.
The line-index property would allow a system to retrieve or update individual facts inside a compressed prompt without re-expanding the entire text.
The same decomposition grammar might be applied in reverse to expand selected facts back into natural language for human review.

Load-bearing premise

The rewriting step can break natural language into atomic facts and replace phrases with symbols without introducing errors or ambiguities that the downstream model cannot resolve correctly.

What would settle it

A replication on a fresh collection of long documents in which the Telegraph-English versions produce accuracy below 95 percent or fall below the deletion baseline at the same token count would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2605.04426 by Alexey A. Shvets, Dmitri Kalaev, Mikhail L. Arbuzov, Sisong Bei, Ziwei Dong.

**Figure 1.** Figure 1: Distribution of compression ratios across 4,081 LongBench-v2 chunks compressed with TE (o4-mini, tiktoken cl100k_base). Mean 0.585, range 0.13–1.57. The right tail above 1.0 corresponds to short, dense inputs that expand under TE view at source ↗

read the original abstract

We introduce Telegraph English (TE), a prompt-compression protocol that rewrites natural language into a symbol-rich, formally-structured dialect. Where token-deletion methods such as LLMLingua-2 train a classifier to delete low-importance tokens at a fixed ratio, TE performs a full semantic rewrite: it decomposes the input into atomic fact lines, substitutes verbose phrases with $\sim$40 logical and relational symbols, and lets the compression ratio adapt to each document's information density. A consequence of the line-structure rule is that compression and semantic chunking become the same operation -- each output line is an independently addressable fact, so the compressed representation is simultaneously a semantic index. We evaluate TE on 4{,}081 question-answer pairs from LongBench-v2 across five OpenAI models and two difficulty levels. At roughly 50\% token reduction, TE preserves 99.1\% accuracy on key facts with GPT-4.1 and outperforms LLMLingua-2 at matched compression ratios on every model and task tested. The gap widens on smaller models -- up to 11 percentage points on fine-detail tasks -- suggesting that explicit relational structure compensates for limited model capacity. We release the grammar specification, compression prompt, benchmark data, and reference implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is a full LLM-driven rewrite of prompts into atomic-fact lines with ~40 symbols, which also creates a semantic index, and it reports clear gains over LLMLingua-2 on LongBench-v2 at 50% compression.

read the letter

The main takeaway is that Telegraph English replaces token deletion with a structured semantic rewrite: it breaks text into atomic fact lines and swaps verbose phrases for a fixed set of logical and relational symbols. This makes compression and chunking the same step, so the output doubles as an addressable index. They test it on 4,081 LongBench-v2 QA pairs across five OpenAI models and show 99.1% key-fact accuracy at roughly 50% token reduction with GPT-4.1, plus consistent outperformance versus LLMLingua-2 at matched ratios, with bigger gaps on smaller models and fine-detail tasks. They also release the grammar, compression prompt, benchmark data, and reference code, which helps reproducibility. The comparison uses a public benchmark and a published baseline without self-referential metrics, so the numbers are straightforward to check. The adaptive ratio based on information density is a reasonable design choice. The soft spot is that semantic fidelity is only measured through downstream QA accuracy. The rewrite itself is an LLM call, and the paper does not appear to include a separate audit—human or automated—of whether original facts are preserved without omissions, rephrasings, or symbol ambiguities that the test questions simply do not catch. If the downstream model can still answer correctly despite small distortions, the reported gap could overstate the structural advantage. The symbol inventory and exact rewriting rules also need clearer documentation on edge cases like negation or quantification. This work is aimed at people building efficient inference pipelines and retrieval-augmented systems who want something beyond simple pruning. Readers who care about practical compression ratios and released artifacts will find usable material here. The paper deserves a serious referee because it has a distinct method, concrete public results, and open code, even though the fidelity validation will need tightening in revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Telegraph English (TE), a prompt-compression protocol that rewrites natural language into a symbol-rich, formally-structured dialect. It decomposes input into atomic fact lines, substitutes verbose phrases with ~40 logical and relational symbols, and adapts the compression ratio to each document's information density. The line-structure rule makes compression and semantic chunking the same operation. Evaluation on 4,081 QA pairs from LongBench-v2 across five OpenAI models and two difficulty levels reports that at ~50% token reduction, TE preserves 99.1% accuracy on key facts with GPT-4.1 and outperforms LLMLingua-2 at matched ratios on every model and task, with larger gaps (up to 11 points) on smaller models for fine-detail tasks. The grammar, prompt, benchmark data, and reference implementation are released.

Significance. If the results hold, the work shows that explicit symbolic structure in compression can preserve semantic fidelity better than token-deletion baselines while also enabling semantic indexing, with particular benefits for capacity-limited models. The public release of grammar specification, compression prompt, benchmark data, and reference implementation is a clear strength that supports reproducibility and extension.

major comments (3)

[Abstract] Abstract and evaluation section: The central claim of 99.1% key-fact accuracy preservation at ~50% reduction (and outperformance over LLMLingua-2) presupposes that the LLM-driven atomic-fact decomposition and symbol substitution are semantically faithful. No direct fidelity audit (human judgment or automated entailment check of original vs. TE facts on held-out data) is reported, so systematic omissions or ambiguities could be invisible in the LongBench-v2 QA metric if the downstream model answers the specific questions correctly anyway.
[Evaluation] Evaluation section: The performance gaps (including the 11-point advantage on smaller models) are presented without statistical details such as standard deviations across runs, confidence intervals, or significance tests. This weakens the claim that explicit relational structure compensates for limited model capacity, as the numerical differences cannot be assessed for robustness.
[Methods] Methods section: The rewriting rules and symbol inventory are described at a high level but the manuscript does not provide error analysis of the compression step itself (e.g., rate of introduced ambiguities or fact omissions). Without this, the attribution of gains specifically to the structured symbolic approach rather than incidental preservation remains under-supported.

minor comments (2)

[Abstract] The abstract states 'five OpenAI models' without naming them; the main text should list the exact models (e.g., GPT-4.1, etc.) for immediate clarity.
[Methods] A summary table of the ~40 symbols and their meanings in the main text would allow readers to understand the dialect without immediately consulting the released artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and evaluation section: The central claim of 99.1% key-fact accuracy preservation at ~50% reduction (and outperformance over LLMLingua-2) presupposes that the LLM-driven atomic-fact decomposition and symbol substitution are semantically faithful. No direct fidelity audit (human judgment or automated entailment check of original vs. TE facts on held-out data) is reported, so systematic omissions or ambiguities could be invisible in the LongBench-v2 QA metric if the downstream model answers the specific questions correctly anyway.

Authors: We agree that a direct fidelity audit would strengthen the central claim. The LongBench-v2 QA pairs are constructed to probe retention of key facts from the source documents, and the 99.1% accuracy with GPT-4.1 provides strong indirect evidence that the TE rewrite preserves the information needed for correct answers. To address the possibility of invisible omissions, we will add to the revised manuscript a human fidelity audit performed on a held-out set of 100 documents. Two annotators will independently compare each original document against its TE version and rate fact completeness, introduced ambiguities, and any semantic drift. Inter-annotator agreement and error rates will be reported. The released reference implementation also enables future automated entailment checks. revision: yes
Referee: [Evaluation] Evaluation section: The performance gaps (including the 11-point advantage on smaller models) are presented without statistical details such as standard deviations across runs, confidence intervals, or significance tests. This weakens the claim that explicit relational structure compensates for limited model capacity, as the numerical differences cannot be assessed for robustness.

Authors: The referee is correct that statistical details are missing. All reported numbers come from single runs per model-task-compression combination, performed under fixed API settings to control cost. The advantage of TE is nevertheless consistent: it outperforms LLMLingua-2 on every one of the five models and both difficulty levels, with the gap largest on smaller models for fine-detail tasks. In the revision we will add an explicit limitations paragraph noting the absence of variance estimates and will include results from three independent compressions on a 20% subset of the benchmark to give readers a sense of stability. Full multi-run statistics across the entire benchmark would require substantial additional API expenditure that was not budgeted in the original study. revision: partial
Referee: [Methods] Methods section: The rewriting rules and symbol inventory are described at a high level but the manuscript does not provide error analysis of the compression step itself (e.g., rate of introduced ambiguities or fact omissions). Without this, the attribution of gains specifically to the structured symbolic approach rather than incidental preservation remains under-supported.

Authors: We accept that a quantitative error analysis of the compression step itself would better support attribution to the symbolic structure. We will expand the Methods section with a new subsection reporting manual inspection of 150 randomly sampled compressed documents. Errors will be categorized into fact omission, introduced ambiguity, incorrect symbol application, and over-compression. Rates will be compared against a token-deletion baseline on the same samples. These results will be used to discuss how the line-structured symbolic rewrite reduces the specific error types that token-deletion methods tend to produce. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes Telegraph English as a semantic rewrite protocol and reports empirical results on the external LongBench-v2 benchmark (4,081 QA pairs). The headline metrics (99.1% key-fact accuracy at ~50% token reduction, outperformance vs. LLMLingua-2) are computed from standard downstream QA accuracy on held-out questions across independent models; they do not reduce to any internal definition, fitted parameter, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that would make the reported performance equivalent to the method's own inputs by construction. The evaluation uses public data and a published baseline, satisfying the criterion for an independent, non-circular empirical claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The method rests on a custom grammar and symbol inventory chosen by the authors plus the assumption that natural language decomposes cleanly into independent atomic facts.

free parameters (1)

symbol inventory = ~40
The set of approximately 40 logical and relational symbols selected for substitution.

axioms (1)

domain assumption Natural language can be decomposed into independent atomic facts without significant semantic loss
Invoked to justify the line-structure rule and compression-as-chunking property.

invented entities (1)

Telegraph English dialect no independent evidence
purpose: Structured symbolic representation for prompt compression and semantic indexing
New protocol defined by the grammar and rewriting rules introduced in the paper.

pith-pipeline@v0.9.0 · 5540 in / 1271 out tokens · 65245 ms · 2026-05-08T17:37:45.741819+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Longbench v2: Towards deeper understanding and reasoning on realistic long-context multitasks, 2025

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks , author=. arXiv preprint arXiv:2412.15204 , year=

work page arXiv
[2]

Proceedings of EMNLP , year=

Adapting Language Models to Compress Contexts , author=. Proceedings of EMNLP , year=
[3]

Reasoning Web , series=

Attempto Controlled English for Knowledge Representation , author=. Reasoning Web , series=
[4]

Retrieval-Augmented Generation for Large Language Models: A Survey

Retrieval-Augmented Generation for Large Language Models: A Survey , author=. arXiv preprint arXiv:2312.10997 , year=

work page internal anchor Pith review arXiv
[5]

Structured Prompting: Scaling In-Context Learning to 1

Hao, Yaru and others , journal=. Structured Prompting: Scaling In-Context Learning to 1
[6]

Proceedings of EMNLP , year=

LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models , author=. Proceedings of EMNLP , year=
[7]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive
[8]

MemGPT: Towards LLMs as Operating Systems

MemGPT: Towards LLMs as Operating Systems , author=. arXiv preprint arXiv:2310.08560 , year=

work page internal anchor Pith review arXiv
[9]

Vicky and Qiu, Lili and Wang, Chuanli , booktitle=

Pan, Zhuoshi and Wu, Qianhui and Jiang, Huiqiang and Xia, Menglin and Luo, Xufang and Zhang, Jue and Lin, Qingwei and Ruhle, Victor and Yang, Yuqing and Lin, Chin-Yew and Zhao, H. Vicky and Qiu, Lili and Wang, Chuanli , booktitle=
[10]

Advances in Neural Information Processing Systems , volume=

Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=
[11]

Advances in Neural Information Processing Systems , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , volume=
[12]

Xu, Fangyuan and Shi, Weijia and Choi, Eunsol , journal=