pith. sign in

arxiv: 2601.14252 · v6 · submitted 2026-01-20 · 💻 cs.IT · cs.PL· math.IT

Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity

Pith reviewed 2026-05-16 12:07 UTC · model grok-4.3

classification 💻 cs.IT cs.PLmath.IT
keywords semantic compressionidentity recoverycollision fiberszero-error codingrate-distortion theoryneurosymbolic systemssymbolic mechanisms
0
0 comments X

The pith

Any non-injective semantic representation requires symbolic identity mechanisms to achieve zero-error recovery of distinct entities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural embeddings and other semantic representations compress information by mapping multiple distinct objects to the same value, creating collision ambiguity that prevents exact identity recovery. The paper derives exact zero-error laws showing that the minimum extra cost to resolve this ambiguity is governed entirely by the collision-fiber geometry of the mapping, including a tight bound on code length and a fiberwise rate-distortion function. These laws apply to arbitrary finite sources through a recoverable-mass decomposition across fibers. A sympathetic reader cares because the results establish a structural necessity for symbolic complements such as handles, keys, or pointers in any system that uses lossy semantic compression. All derivations are machine-checked.

Core claim

The collision-fiber geometry of a representation map π, with A_π equal to the size of the largest fiber, controls all residual costs of exact identity recovery. This yields a fixed-length converse L ≥ log₂ A_π, an exact finite-block scaling law, a pointwise adaptive budget ⌈log₂ |π^{-1}(u)|⌉, and a fiberwise rate-distortion law for arbitrary finite sources. When all mass lies on one block the uniform formula simplifies to D*(L) = max(0, 1 - 2^L / a). Because the ambiguity is structural, symbolic identity mechanisms become the necessary system-level complement to any non-injective semantic representation.

What carries the argument

The collision-fiber geometry of the representation map π, whose preimage sizes |π^{-1}(u)| determine the exact coding bounds and rate-distortion expressions for identity recovery.

If this is right

  • Fixed-length identity codes must use at least log₂ A_π bits.
  • Pointwise budgets allocate exactly ⌈log₂ |π^{-1}(u)|⌉ bits per fiber.
  • The rate-distortion function takes the closed form D*(L) = max(0, 1 - 2^L / a) when mass concentrates on one block.
  • Query complexity for distinguishing families follows the same fiber geometry.
  • All laws extend to arbitrary finite sources via recoverable-mass decomposition across fibers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid AI systems must maintain separate symbolic layers for identity tracking instead of attempting to encode it inside continuous embeddings.
  • The same geometry supplies a diagnostic for when existing embedding models will systematically fail at entity disambiguation tasks.
  • Database and memory systems already embody these laws through the use of explicit keys and pointers precisely because their indexes are lossy.
  • The framework suggests a natural next step: approximate fiber-size estimation for continuous or high-dimensional sources to predict identity-recovery costs.

Load-bearing premise

The collision-fiber sizes of the representation map capture all residual ambiguity that must be resolved for exact identity recovery in any finite source.

What would settle it

A concrete semantic representation and finite source where the measured bits or queries needed for zero-error identity recovery deviate from the formulas based on its observed fiber sizes.

read the original abstract

Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $\pi$. Let $A_{\pi}=\max_u |\pi^{-1}(u)|$ be the largest collision fiber. The finite laws include a tight fixed-length converse $L \ge \log_2 A_{\pi}$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |\pi^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_{\pi}$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript derives a family of zero-error compression laws and rate-distortion bounds for semantic identity recovery from a representation map π, controlled by the collision-fiber geometry (A_π = max_u |π^{-1}(u)|). It establishes a tight fixed-length converse L ≥ log₂ A_π, exact finite-block scaling, a pointwise adaptive budget, and a fiberwise rate-distortion law via recoverable-mass decomposition for arbitrary finite sources. The central claim is that symbolic identity mechanisms are the necessary system-level complement to any non-injective semantic representation; all main results are machine-checked in Lean 4.

Significance. If the derivations hold, the work supplies a parameter-free, combinatorial foundation for the structural necessity of symbolic components in neurosymbolic systems. The machine-checked proofs, closed-form expressions such as D^*(L) = max(0, 1 - 2^L/a), and direct applicability to arbitrary finite sources constitute clear strengths that enable falsifiable predictions for representation design.

major comments (1)
  1. [Abstract and §3 (recoverable-mass decomposition)] The necessity corollary rests on the assumption that collision-fiber geometry fully captures residual identity ambiguity for arbitrary finite sources. While the zero-error converse L ≥ log₂ A_π follows directly from the definition of A_π, the manuscript does not explicitly rule out additional sources of ambiguity (e.g., context-dependent or non-combinatorial) that could arise outside the fiber decomposition; this assumption is load-bearing for the system-level claim.
minor comments (2)
  1. [Abstract] The uniform single-block formula D^*(L) = max(0, 1 - 2^L/a) is presented as a special case; an explicit statement of the conditions under which all mass lies on one collision block would improve clarity.
  2. [§4] Notation for the pointwise adaptive budget ⌈log₂ |π^{-1}(u)|⌉ should be cross-referenced to the definition of the fiberwise rate-distortion law to avoid ambiguity in the finite-block scaling section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the need to clarify the scope of the necessity corollary. We address the comment below and will make a targeted revision to improve precision.

read point-by-point responses
  1. Referee: [Abstract and §3 (recoverable-mass decomposition)] The necessity corollary rests on the assumption that collision-fiber geometry fully captures residual identity ambiguity for arbitrary finite sources. While the zero-error converse L ≥ log₂ A_π follows directly from the definition of A_π, the manuscript does not explicitly rule out additional sources of ambiguity (e.g., context-dependent or non-combinatorial) that could arise outside the fiber decomposition; this assumption is load-bearing for the system-level claim.

    Authors: We agree that an explicit scoping statement strengthens the presentation. Within the paper's model—arbitrary finite sources X equipped with a fixed representation map π: X → U—the collision fibers π^{-1}(u) exhaust the identity ambiguity by definition: given only the representation value u, the possible preimages are precisely the fiber elements. All derived laws (zero-error converse, finite-block scaling, pointwise budget, and fiberwise rate-distortion via recoverable-mass decomposition) follow directly from this geometry. Context-dependent or non-combinatorial ambiguities would require an extended model (e.g., side information channels or non-finite alphabets) outside the present finite-source combinatorial framework. To make the boundary explicit, we will revise the abstract and the opening of §3 to state: 'for arbitrary finite sources under a representation map π, where residual identity ambiguity is fully captured by the collision-fiber geometry.' This is a minor clarification that does not alter any theorem or proof. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained via formal verification

full rationale

The paper defines the collision-fiber size A_π = max_u |π^{-1}(u)| as the sole combinatorial object controlling residual ambiguity, then derives the zero-error converse L ≥ log₂ A_π, finite-block scaling, pointwise adaptive budget, and fiberwise rate-distortion law directly from recoverable-mass decomposition over arbitrary finite sources and representation maps π. All central results are machine-checked in Lean 4, providing external formal verification independent of fitted parameters or self-referential equations. The necessity of symbolic mechanisms follows as a corollary once fiber geometry is accepted as the complete description of identity ambiguity; no load-bearing step reduces the claimed laws to their inputs by construction, and no self-citation chain is invoked for the uniqueness or derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard information-theoretic and combinatorial axioms plus the introduced collision-fiber concept; no free parameters are fitted to data.

axioms (1)
  • standard math Standard axioms of finite information theory and combinatorics for rate-distortion and fiber decompositions
    Invoked for the scaling laws, converse bounds, and recoverable-mass decomposition across fibers.
invented entities (1)
  • collision-fiber geometry no independent evidence
    purpose: Quantifies structural residual ambiguity in non-injective representations
    Defined via A_π = max fiber size and used as the controlling object for all laws and necessity claims.

pith-pipeline@v0.9.0 · 5573 in / 1173 out tokens · 50321 ms · 2026-05-16T12:07:33.284429+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    A theoretical analysis of contrastive unsupervised representation learning,

    N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and N. Khandeparkar, “A theoretical analysis of contrastive unsupervised representation learning,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 5628–5637. [Online]. Available: http://proceedings.mlr.press/v97/saunshi19a.html

  2. [2]

    β-vae: Learning basic visual concepts with a constrained variational framework,

    I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, A. Lerchner, Z. Zhenget al., “β-vae: Learning basic visual concepts with a constrained variational framework,” inInternational Conference on Learning Representations (ICLR), 2017. [Online]. Available: https://openreview.net/forum?id=Hkx9Qkcg

  3. [3]

    Deep variational information bottleneck,

    A. Alemi, A. Fischer, V . Soro, and P. Sontag, “Deep variational information bottleneck,” inInternational Conference on Learning Representations (ICLR), 2017. [Online]. Available: https://openreview. net/forum?id=H1g1Ylxl

  4. [4]

    Understanding dimensional collapse in contrastive self-supervised learning,

    L. Jing, P. Vincent, Y . LeCun, and Y . Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=YeuYXfR2iOa 79 :NSL1-6

  5. [5]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere,

    T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in International Conference on Machine Learning (ICML). PMLR, 2020, pp. 9929–9939. [Online]. Available: http://proceedings.mlr.press/v119/ wang20k.html

  6. [6]

    Abduction-based explanations for machine learning models,

    A. Ignatiev, N. Narodytska, and J. Marques-Silva, “Abduction-based explanations for machine learning models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 1511– 1519

  7. [7]

    Delivering trustworthy AI through formal XAI,

    J. Marques-Silva and A. Ignatiev, “Delivering trustworthy AI through formal XAI,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, 2022, pp. 12 342–12 350

  8. [8]

    Differential privacy,

    C. Dwork, “Differential privacy,” inInternational Colloquium on Au- tomata, Languages, and Programming (ICALP). Springer, 2006, pp. 1–12

  9. [9]

    Noiseless coding of correlated information sources,

    D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,”IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, 1973

  10. [10]

    The rate-distortion function for source coding with side information at the decoder,

    A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,”IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976

  11. [11]

    Source coding theory for cascade and branching com- munication systems,

    H. Yamamoto, “Source coding theory for cascade and branching com- munication systems,”IEEE Transactions on Information Theory, vol. 27, no. 3, pp. 299–308, 1981

  12. [12]

    Gaussian multiterminal source coding,

    Y . Oohama, “Gaussian multiterminal source coding,”IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 1912–1923, 1997

  13. [13]

    The Wyner–Ziv problem with coded side information,

    M. Gastpar, “The Wyner–Ziv problem with coded side information,” IEEE Transactions on Information Theory, vol. 50, no. 11, pp. 2762– 2768, 2004

  14. [14]

    The zero error capacity of a noisy channel,

    C. E. Shannon, “The zero error capacity of a noisy channel,”IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8–19, 1956

  15. [15]

    On the shannon capacity of a graph,

    L. Lovász, “On the shannon capacity of a graph,”IEEE Transactions on Information Theory, vol. 25, no. 1, pp. 1–7, 1979

  16. [16]

    uz, Z. Qin, I. Estella Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K.-K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,

    D. Gund"uz, Z. Qin, I. Estella Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K.-K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,”IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 5–41, 2023

  17. [17]

    Semantic communications in networked systems: A data significance perspective,

    E. Uysal, O. Kaya, A. Ephremideset al., “Semantic communications in networked systems: A data significance perspective,”IEEE Network, vol. 36, no. 4, pp. 233–240, 2022

  18. [18]

    Towards a theory of semantic communication,

    J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, and W. Leland, “Towards a theory of semantic communication,” in2011 IEEE Network Science Workshop. IEEE, 2011, pp. 110–117

  19. [19]

    Semantics-empowered communication for networked intelligent systems,

    M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,”IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021

  20. [20]

    The information bottleneck method

    N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000

  21. [21]

    Rethinking lossy compression: The rate- distortion-perception tradeoff,

    Y . Blau and T. Michaeli, “Rethinking lossy compression: The rate- distortion-perception tradeoff,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97, 2019, pp. 675–685. [Online]. Available: https://proceedings.mlr.press/v97/blau19a.html

  22. [22]

    On the rate-distortion-perception function,

    J. Chen, L. Yu, J. Wang, W. Shi, Y . Ge, and W. Tong, “On the rate-distortion-perception function,”IEEE Journal on Selected Areas in Information Theory, vol. 3, no. 4, pp. 664–673, 2022

  23. [23]

    Output-constrained lossy source coding with application to rate-distortion-perception theory,

    L. Xie, L. Li, J. Chen, and Z. Zhang, “Output-constrained lossy source coding with application to rate-distortion-perception theory,” IEEE Transactions on Communications, vol. 73, no. 3, pp. 1801–1815, 2025

  24. [24]

    Perfect graphs and graph entropy,

    G. Simonyi, “Perfect graphs and graph entropy,”Combinatorica, vol. 15, no. 4, pp. 541–549, 1995

  25. [25]

    Source coding and graph entropies,

    N. Alon and A. Orlitsky, “Source coding and graph entropies,”IEEE Transactions on Information Theory, vol. 42, no. 5, pp. 1329–1339, 1996

  26. [26]

    An upper bound for the shannon capacity of a graph,

    C. Haemers, “An upper bound for the shannon capacity of a graph,” Colloquium Mathematicum Societatis János Bolyai, vol. 25, pp. 267– 272, 1978

  27. [27]

    The asymptotic spectrum of graphs and the shannon capacity,

    J. Zuiddam, “The asymptotic spectrum of graphs and the shannon capacity,”Combinatorica, vol. 39, pp. 1173–1184, 2019

  28. [28]

    The zero-error side information problem and chromatic numbers,

    H. S. Witsenhausen, “The zero-error side information problem and chromatic numbers,”IEEE Transactions on Information Theory, vol. 22, no. 5, pp. 592–593, 1976

  29. [29]

    Channel coding rate in the finite blocklength regime,

    Y . Polyanskiy, H. V . Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,”IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010. 13

  30. [30]

    Csiszár and J

    I. Csiszár and J. Körner,Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011

  31. [31]

    D. J. A. Welsh,Matroid Theory. Academic Press, 1976

  32. [32]

    T. M. Cover and J. A. Thomas,Elements of Information Theory, 2nd ed. Wiley-Interscience, 2006

  33. [33]

    The information- theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems,

    A. Barron, M. J. Schervish, and L. Wasserman, “The information- theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems,”Bayesian Statistics, vol. 6, pp. 27–52, 1998

  34. [34]

    Retrieval- augmented generation for knowledge-intensive nlp tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper/ 2020/hash/6b4...

  35. [35]

    Coding for computing,

    A. Orlitsky and J. R. Roche, “Coding for computing,”IEEE Transactions on Information Theory, vol. 47, no. 3, pp. 903–917, 2001

  36. [36]

    Functional compression through graph coloring,

    V . Doshi, D. Shah, M. Médard, and M. Effros, “Functional compression through graph coloring,”IEEE Transactions on Information Theory, vol. 56, no. 8, pp. 3901–3917, 2010

  37. [37]

    Identification via channels,

    R. Ahlswede and G. Dueck, “Identification via channels,”IEEE Trans- actions on Information Theory, vol. 35, no. 1, pp. 15–29, 1989

  38. [38]

    Coding of an information source having ambiguous alphabet and the entropy of graphs,

    J. Körner, “Coding of an information source having ambiguous alphabet and the entropy of graphs,”6th Prague Conference on Information Theory, pp. 411–425, 1973