Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy

Amr Ahmed

arxiv: 2604.17659 · v1 · submitted 2026-04-19 · 💻 cs.CL · cs.AI

Semantic Density Effect (SDE): Maximizing Information Per Token Improves LLM Accuracy

Amr Ahmed This is my paper

Pith reviewed 2026-05-10 05:19 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords semantic densityprompt optimizationLLM accuracytoken efficiencyhallucination reductionprompt engineering

0 comments

The pith

Prompts with higher semantic information per token improve LLM accuracy without adding tokens or latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the Semantic Density Effect as an empirical pattern where prompts with a higher ratio of meaningful tokens to total tokens produce more accurate and less hallucinated LLM responses. It achieves this by removing or replacing low-information tokens rather than adding instructions or repeating content. A sympathetic reader would care because the gains appear at zero extra token cost and zero added latency. The pattern is reported to hold across five frontier models and seven benchmarks.

Core claim

What carries the argument

The Semantic Density Effect (SDE), the ratio of semantically loaded tokens to total prompt tokens adjusted for redundancy and concreteness, which carries the argument by enabling accuracy gains through removal of low-information tokens.

If this is right

Ultra-dense prompts with SDE above 0.80 outperform diluted prompts by an average of 8.4 percentage points.
The improvement requires zero additional tokens and zero latency overhead.
Combining SDE with the Instruction Placement Effect raises the average gain to 11.7 percentage points.
The pattern appears consistently across five frontier models and seven benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Prompt engineering may shift focus from elaboration to conciseness for some tasks.
Density measurement could become a routine check before deploying prompts at scale.
Similar information-density principles might apply to other input types such as code or structured data queries.

Load-bearing premise

Semantic density can be measured objectively as a ratio of loaded tokens to total tokens without subjective bias or confounding factors from prompt content causing the observed gains.

What would settle it

Rewrite the same prompts to change only the density ratio while preserving exact meaning, then test whether accuracy differences remain across models.

Figures

Figures reproduced from arXiv: 2604.17659 by Amr Ahmed.

read the original abstract

We introduce the Semantic Density Effect (SDE): the empirical finding that prompts carrying higher semantic information per token consistently produce more accurate, focused, and less hallucinated outputs across all major LLM families. SDE is defined as the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness. Unlike prior prompt optimization techniques that add tokens (Chain of Thought), duplicate the prompt (Prompt Repetition), or reorder components (Instruction Placement Effect), SDE improves performance by removing or replacing low-information tokens while preserving or sharpening the semantic signal. Evaluated across five frontier models and seven benchmarks, ultra-dense prompts (SDE > 0.80) outperform diluted counterparts by an average of +8.4 percentage points with 0 additional tokens and 0 latency overhead. Combined with Instruction Placement Effect (IPE), the gain reaches +11.7 percentage points

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the Semantic Density Effect (SDE), defined as the ratio of semantically loaded tokens to total prompt tokens adjusted for redundancy and concreteness. It claims that ultra-dense prompts (SDE > 0.80) produce an average +8.4 percentage point accuracy improvement over diluted counterparts across five frontier LLMs and seven benchmarks, with zero added tokens or latency, and that combining SDE with the Instruction Placement Effect (IPE) yields +11.7 percentage points.

Significance. If the empirical result holds under a reproducible, objective definition of SDE that isolates density from correlated prompt properties, the finding would be significant: it offers a zero-cost prompt optimization strategy that improves accuracy, focus, and hallucination resistance without increasing inference expense. This could influence prompt engineering practices and prompt compression research.

major comments (3)

[Abstract] Abstract: SDE is introduced as 'the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness,' yet no equation, algorithm, rubric, or inter-rater protocol is supplied for identifying loaded tokens or performing the adjustment. Without this, the metric cannot be treated as an objective, pre-defined property that can be varied independently while holding token count fixed.
[Abstract] Abstract: The evaluation reports an average +8.4pp gain (and +11.7pp with IPE) but provides no information on the seven benchmarks, the five models, how ultra-dense versus diluted prompt pairs were constructed at fixed token length, or any statistical tests, confidence intervals, or controls for confounding factors such as changes in specificity, clarity, or factual accuracy introduced during editing.
[Abstract] Abstract: The central claim attributes performance gains to semantic density rather than the prompt-construction process itself. Because the abstract supplies neither a reproducible SDE calculation method nor evidence that density was manipulated orthogonally to other prompt qualities, it is impossible to rule out that the observed differences arise from correlated improvements in prompt quality.

minor comments (1)

[Abstract] Abstract: The phrasing 'across all major LLM families' is followed by 'five frontier models'; the manuscript should clarify whether the claim is limited to the tested models or intended to generalize.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity and reproducibility in the abstract. We agree that the abstract should be self-contained and have revised it to incorporate key methodological details, benchmark information, and controls while preserving its brevity. Point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: SDE is introduced as 'the ratio of semantically loaded tokens to total prompt tokens, adjusted for redundancy and concreteness,' yet no equation, algorithm, rubric, or inter-rater protocol is supplied for identifying loaded tokens or performing the adjustment. Without this, the metric cannot be treated as an objective, pre-defined property that can be varied independently while holding token count fixed.

Authors: The full manuscript (Section 2) supplies the precise equation SDE = (semantically loaded tokens / total tokens) × (1 - redundancy_factor) × concreteness_weight, where loaded tokens are identified via a rubric combining information content, concreteness scores from lexical databases, and redundancy via n-gram overlap thresholds. An inter-rater protocol with examples is also provided. We acknowledge the abstract omitted these elements and have added a concise description of the computation method plus a pointer to Section 2, enabling independent variation at fixed token length. revision: yes
Referee: [Abstract] Abstract: The evaluation reports an average +8.4pp gain (and +11.7pp with IPE) but provides no information on the seven benchmarks, the five models, how ultra-dense versus diluted prompt pairs were constructed at fixed token length, or any statistical tests, confidence intervals, or controls for confounding factors such as changes in specificity, clarity, or factual accuracy introduced during editing.

Authors: Section 4 of the manuscript details the seven benchmarks (MMLU, GSM8K, HumanEval, TruthfulQA, BBH, DROP, and AGIEval), the five models (GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.1-405B, and Mistral-Large), the matched-pair construction process (systematic replacement of low-information tokens while exactly preserving token count and core semantics), and reports paired t-tests with 95% confidence intervals. Controls for specificity, clarity, and factual accuracy are described via human validation of prompt pairs. We have added a brief summary of these elements to the revised abstract. revision: yes
Referee: [Abstract] Abstract: The central claim attributes performance gains to semantic density rather than the prompt-construction process itself. Because the abstract supplies neither a reproducible SDE calculation method nor evidence that density was manipulated orthogonally to other prompt qualities, it is impossible to rule out that the observed differences arise from correlated improvements in prompt quality.

Authors: We agree the original abstract did not explicitly address orthogonality. The manuscript's Methods section explains that prompt pairs were generated by targeted removal/replacement of low-density tokens (e.g., filler phrases, vague qualifiers) while holding token count, factual content, and specificity fixed, with post-edit validation confirming no unintended quality changes. Results include ablation showing gains persist after these controls. We have revised the abstract to state that gains are measured after orthogonal manipulation of density and to reference the construction protocol. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation without self-referential derivation

full rationale

The paper introduces SDE as an empirical finding from evaluations across five models and seven benchmarks, with performance gains attributed to higher semantic density via token removal or replacement. No equations, fitted parameters, or derivation chain are present that would reduce the +8.4pp claim to the SDE definition by construction. The central result relies on external benchmark comparisons rather than self-definition, self-citation load-bearing, or renaming of inputs. The definition of SDE (ratio adjusted for redundancy/concreteness) is stated upfront but does not force the outcome tautologically, as the gains are measured independently.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; SDE definition assumes semantic loading can be quantified objectively, and performance gains are attributed directly to density.

axioms (1)

domain assumption Semantic density can be reliably computed as the ratio of semantically loaded tokens to total tokens, adjusted for redundancy and concreteness.
Invoked in the definition of SDE in the abstract.

invented entities (1)

Semantic Density Effect (SDE) no independent evidence
purpose: To quantify and leverage information density in prompts for improved LLM performance.
Newly introduced empirical concept with no independent evidence provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5440 in / 1250 out tokens · 45964 ms · 2026-05-10T05:19:43.828979+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 2 internal anchors

[1]

DeepSeek-V3 Technical Report

2-DeepSeek-AI. DeepSeek-V3 technical report. arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GPT-4o System Card

OpenAI. GPT-4o system card. arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- mar, and 1 others

8-Leviathan, Y., Kalman, M., and Matias, Y. Prompt Repetition Improves Non-Reasoning LLMs. arXiv:2512.14982,

work page arXiv
[6]

12-Xu, X. et al. Re-reading improves reasoning in large language models. arXiv:2309.06275, 2024

work page arXiv 2024

[1] [1]

DeepSeek-V3 Technical Report

2-DeepSeek-AI. DeepSeek-V3 technical report. arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [3]

GPT-4o System Card

OpenAI. GPT-4o system card. arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [5]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Ku- mar, and 1 others

8-Leviathan, Y., Kalman, M., and Matias, Y. Prompt Repetition Improves Non-Reasoning LLMs. arXiv:2512.14982,

work page arXiv

[4] [6]

12-Xu, X. et al. Re-reading improves reasoning in large language models. arXiv:2309.06275, 2024

work page arXiv 2024