arxiv: 2604.13066 · v1 · submitted 2026-03-19 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Lossless Prompt Compression via Dictionary-Encoding and In-Context Learning: Enabling Cost-Effective LLM Analysis of Repetitive Data

Andresa Rodrigues de Campos , David Lee , Imry Kissos , Piyush Paritosh

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords prompt compressiondictionary encodingin-context learningLLM analysislossless compressionrepetitive datacost reductionlog analysis

0 comments

The pith

LLMs learn encoding keys from a system prompt dictionary and analyze compressed repetitive data directly, matching uncompressed results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLMs can interpret meta-tokens in compressed prompts when the mapping dictionary is supplied in the system prompt, enabling direct analysis without decompression or model changes. This supports a dictionary-encoding method that detects repetitive subsequences at multiple scales and replaces them with compact tokens, provided the swap yields net token savings. Compression ratios reach up to 80 percent on suitable datasets. Validation treats accurate decompression as a proxy for preserved analytical quality, yielding exact match rates above 0.99 on template-based cases and Levenshtein similarity above 0.91 on algorithmic cases even at 60-80 percent compression. Output similarity depends primarily on dataset repetition patterns rather than compression depth.

Core claim

The paper establishes that LLMs, when supplied with a compression dictionary in the system prompt, correctly interpret meta-tokens and perform analysis on the encoded representation, producing outputs equivalent to those obtained from the original uncompressed inputs. This holds for a multi-scale dictionary-encoding algorithm that identifies repetitive subsequences and substitutes them with compact meta-tokens under a token-savings optimization rule. Experiments on LogHub 2.0 using Claude 3.7 Sonnet confirm equivalence via a decompression proxy task, with exact matches exceeding 0.99 for template compression and average similarity above 0.91 for algorithmic compression at 60-80 percent rates

What carries the argument

Dictionary encoding that replaces frequent subsequences with meta-tokens, with the mapping provided in the system prompt so the LLM learns the encoding in-context and operates directly on the compressed sequence.

If this is right

Token counts and API costs drop substantially for repetitive data while output equivalence is retained.
No fine-tuning or training is required, allowing immediate use with existing API-based LLMs.
Datasets exceeding normal token limits become analyzable without truncation or summarization.
Compression quality remains stable across varying ratios, depending instead on inherent data repetition.
The method adapts automatically as repetitive patterns shift in evolving datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to structured text domains such as code or configuration files that contain repeated blocks.
Combining dictionary encoding with other token-reduction techniques could produce additive efficiency gains.
Real-time streaming applications like log monitoring could apply ongoing dictionary updates for sustained savings.
Dictionary size limits imposed by context windows may require separate optimization for very large vocabularies.

Load-bearing premise

Supplying the compression dictionary in the system prompt is sufficient for the LLM to interpret meta-tokens correctly during analysis without misreading or hallucinating the mapping.

What would settle it

An instance where analysis output on the dictionary-provided compressed prompt differs from output on the matching uncompressed prompt.

Figures

Figures reproduced from arXiv: 2604.13066 by Andresa Rodrigues de Campos, David Lee, Imry Kissos, Piyush Paritosh.

**Figure 1.** Figure 1: System overview showing hierarchical compression algorithm (right), subsequence finding detail (top left), and compressed output with dictionary (bottom left). 3 METHODOLOGY 3.1 Problem Formulation Given input text T containing repetitive subsequences, our goal is to generate compressed text T ′ and dictionary D such that: (1) T can be perfectly reconstructed from T ′ and D (lossless reconstruction), (2) n… view at source ↗

**Figure 2.** Figure 2: Compression ratios vs. Lmax without dictionary overhead (right) and with dictionary overhead (left). Most datasets show plateau behavior, while some exhibit optimal Lmax values beyond which dictionary overhead reduces compression efficiency. 5 RESULTS 5.1 Compression Efficiency The amount of compression obtained for each dataset depends on its intrinsic repetition patterns [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 3.** Figure 3: Template-based decompression results comparing Claude 3.7 Sonnet and Nova Premier across LogHub 2k datasets. Each point represents the average metric score computed over 2,000 logs per dataset [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Compression ratios and decompression quality as a function of Lmax. Top-left: compression ratios by dataset. Remaining panels: Levenshtein, ROUGE, and BLEU scores. Most datasets maintain scores above 0.9 regardless of compression level; outliers (HPC, Thunderbird) show consistently lower scores across all Lmax values, indicating dataset-specific rather than compression-related effects [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 5.** Figure 5: Relationship between compression ratio and decompression quality. Each point represents one (dataset, Lmax) configuration. Dashed lines show linear regression fits; shaded regions indicate 95% confidence intervals. R2 values below 0.02 for all metrics demonstrate that compression intensity does not predict reconstruction quality—performance variation is driven by dataset characteristics rather than compres… view at source ↗

**Figure 6.** Figure 6: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Apache dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for BGL dataset and Lmax values. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for HDFS dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for HPC dataset and Lmax values. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Hadoop dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for HealthApp dataset and Lmax values. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗

**Figure 12.** Figure 12: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Linux dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗

**Figure 13.** Figure 13: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Mac dataset and Lmax values. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗

**Figure 14.** Figure 14: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for OpenSSH dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p018_14.png] view at source ↗

**Figure 15.** Figure 15: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for OpenStack dataset and Lmax values. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗

**Figure 16.** Figure 16: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Proxifier dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗

**Figure 17.** Figure 17: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Spark dataset and Lmax values. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_17.png] view at source ↗

**Figure 18.** Figure 18: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Thunderbird dataset and Lmax values [PITH_FULL_IMAGE:figures/full_fig_p020_18.png] view at source ↗

**Figure 19.** Figure 19: Levenshtein similarity, ROUGE, and BLEU scores for algorithmic compression for Zookeeper dataset and Lmax values. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

read the original abstract

In-context learning has established itself as an important learning paradigm for Large Language Models (LLMs). In this paper, we demonstrate that LLMs can learn encoding keys in-context and perform analysis directly on encoded representations. This finding enables lossless prompt compression via dictionary encoding without model fine-tuning: frequently occurring subsequences are replaced with compact meta-tokens, and when provided with the compression dictionary in the system prompt, LLMs correctly interpret these meta-tokens during analysis, producing outputs equivalent to those from uncompressed inputs. We present a compression algorithm that identifies repetitive patterns at multiple length scales, incorporating a token-savings optimization criterion that ensures compression reduces costs by preventing dictionary overhead from exceeding savings. The algorithm achieves compression ratios up to 80$\%$ depending on dataset characteristics. To validate that LLM analytical accuracy is preserved under compression, we use decompression as a proxy task with unambiguous ground truth. Evaluation on the LogHub 2.0 benchmark using Claude 3.7 Sonnet demonstrates exact match rates exceeding 0.99 for template-based compression and average Levenshtein similarity scores above 0.91 for algorithmic compression, even at compression ratios of 60$\%$-80$\%$. Additionally, compression ratio explains less than 2$\%$ of variance in similarity metrics, indicating that decompression quality depends on dataset characteristics rather than compression intensity. This training-free approach works with API-based LLMs, directly addressing fundamental deployment constraints -- token limits and API costs -- and enabling cost-effective analysis of large-scale repetitive datasets, even as data patterns evolve over time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs can use an in-context dictionary to handle compressed repetitive data for analysis, but the decompression proxy leaves the core claim only partly tested.

read the letter

The main point here is a practical one: compress repetitive sequences like logs with a multi-scale dictionary, put the mapping in the system prompt, and let the LLM work on the shorter encoded version without any training. They get compression ratios up to 80 percent on LogHub 2.0 while keeping high reconstruction fidelity with Claude 3.7 Sonnet. The algorithm includes a check so the dictionary overhead does not wipe out the savings, which is a sensible detail for real API use. That part is straightforward engineering and directly tackles token limits and costs on evolving repetitive datasets. The training-free angle is also useful since it works with black-box models. The evaluation reports exact match above 0.99 for templates and Levenshtein above 0.91 for the algorithmic version, with compression ratio explaining under 2 percent of the variance in quality. Those numbers are clean for what they measured. The soft spot is the validation design. The central claim is that the model performs analysis directly on the encoded meta-tokens. Instead they test a decompression proxy task, which succeeds by mapping tokens back to text. That does not confirm the model is actually reasoning over the compressed form rather than expanding it internally. A head-to-head comparison of analysis outputs on compressed versus original inputs would close the gap. The abstract does not show that direct test, so the cost savings for the analysis step itself remain unproven. This is aimed at engineers running LLM analysis on logs or similar streams where patterns repeat. It is worth sending for peer review because the idea is simple to implement and the reported results are strong enough to merit checking the full methods and asking for the missing direct comparison. Referees can sort out whether the proxy is sufficient or if the claim needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that LLMs can learn encoding keys in-context from a system prompt containing a compression dictionary and perform analysis directly on dictionary-encoded meta-token representations, producing outputs equivalent to those from uncompressed inputs. It introduces a multi-scale dictionary compression algorithm with a token-savings optimization criterion that achieves up to 80% compression on repetitive data. Validation uses a decompression proxy task on LogHub 2.0 with Claude 3.7 Sonnet, reporting exact match >0.99 for template-based compression and Levenshtein similarity >0.91 for algorithmic compression, with compression ratio explaining <2% of variance in similarity metrics.

Significance. If the central claim holds, the work would provide a training-free method for substantial token and cost reduction in LLM analysis of repetitive datasets such as logs, directly addressing API limits and expenses while allowing patterns to evolve. The proxy metrics are quantitatively strong and the validation is non-circular, but the significance is limited by the gap between the claimed direct analytical use of encoded representations and the decompression-based evidence provided.

major comments (2)

[Evaluation] The validation relies exclusively on a decompression proxy task (exact match and Levenshtein metrics) rather than direct analytical tasks performed on the encoded meta-token inputs. This tests token-to-text reconstruction but does not establish whether the LLM executes the intended reasoning steps over the compressed representation itself, which is required to support the claim of cost-effective analysis on encoded data. See abstract paragraph on validation and the evaluation description.
[Methods] The compression algorithm is described at a high level (multi-scale pattern identification and token-savings optimization to avoid dictionary overhead exceeding savings) but lacks the precise pseudocode, decision criteria for subsequence selection, and handling of edge cases needed for reproducibility. Without these details it is difficult to verify that the reported 60-80% ratios are achieved under the stated constraints.

minor comments (2)

[Abstract] Clarify the exact model version (Claude 3.7 Sonnet appears to be a typographical error for Claude 3.5 Sonnet or the current release).
[Results] Ensure all figures showing compression ratios versus similarity metrics include error bars or statistical controls and are referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the empirical support and reproducibility of our work. We address each major comment below and will incorporate revisions to enhance the manuscript accordingly.

read point-by-point responses

Referee: [Evaluation] The validation relies exclusively on a decompression proxy task (exact match and Levenshtein metrics) rather than direct analytical tasks performed on the encoded meta-token inputs. This tests token-to-text reconstruction but does not establish whether the LLM executes the intended reasoning steps over the compressed representation itself, which is required to support the claim of cost-effective analysis on encoded data. See abstract paragraph on validation and the evaluation description.

Authors: We appreciate this distinction. The decompression proxy was selected to provide unambiguous ground truth while testing whether the LLM correctly interprets meta-tokens from the in-context dictionary; the reported exact match rates exceeding 0.99 and Levenshtein similarities above 0.91 indicate faithful reconstruction, which is a necessary precondition for any downstream reasoning on the compressed form. Nevertheless, to more directly demonstrate analytical capabilities, we will revise the evaluation section to include direct analytical tasks (such as anomaly detection and summarization) performed on encoded inputs, with outputs compared against uncompressed baselines. These additions will be detailed in the revised manuscript. revision: yes
Referee: [Methods] The compression algorithm is described at a high level (multi-scale pattern identification and token-savings optimization to avoid dictionary overhead exceeding savings) but lacks the precise pseudocode, decision criteria for subsequence selection, and handling of edge cases needed for reproducibility. Without these details it is difficult to verify that the reported 60-80% ratios are achieved under the stated constraints.

Authors: We agree that greater specificity is required for reproducibility. In the revised manuscript, we will augment the Methods section with full pseudocode for the multi-scale algorithm, explicit decision criteria for subsequence selection (including frequency thresholds, length scales, and the token-savings optimization formula), and discussion of edge-case handling such as overlapping patterns, minimum occurrence requirements, and dictionary size constraints. This will enable verification of the reported compression ratios. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's central demonstration relies on an empirical compression algorithm that identifies repetitive patterns and a separate validation using decompression as an independent proxy task with unambiguous ground truth (exact match and Levenshtein metrics). No equations, fitted parameters, or results reduce by construction to the inputs or to self-citations; the token-savings criterion and in-context interpretation are defined externally to the analytical equivalence claim. The derivation remains self-contained against external benchmarks rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that current LLMs possess sufficient in-context learning capacity to map and apply meta-tokens correctly when a dictionary is supplied; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption LLMs can learn and apply encoding keys provided in the system prompt during analysis tasks
This is the load-bearing premise stated in the abstract for the lossless compression claim.

pith-pipeline@v0.9.0 · 5599 in / 1183 out tokens · 22505 ms · 2026-05-15T07:53:54.575302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lossless prompt compression via dictionary encoding... meta-tokens... token-savings optimization criterion
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs can learn encoding keys in-context and perform analysis directly on encoded representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

URL https://arxiv.org/abs/2403. 18119. Zhu, X. et al. Loghub: A large collection of system log datasets towards automated log analytics. EMSE, 28:54,

work page
[2]

URL https: //doi.org/10.1186/s40411-023-00194-9

doi: 10.1186/s40411-023-00194-9. URL https: //doi.org/10.1186/s40411-023-00194-9 . A D ATASET TOKEN STATISTICS Table 3 presents the original token counts for each dataset in the LogHub-2k experiments, computed using the Claude tokenizer. B T EMPLATE -BASED DECOMPRESSION DETAILED METRICS Table 4 presents per-dataset metrics for template-based de- compressi...

work page doi:10.1186/s40411-023-00194-9 2000