Semantic Identity Compression: Zero-Error Laws, Rate-Distortion, and Neurosymbolic Necessity
Pith reviewed 2026-05-16 12:07 UTC · model grok-4.3
The pith
Any non-injective semantic representation requires symbolic identity mechanisms to achieve zero-error recovery of distinct entities.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The collision-fiber geometry of a representation map π, with A_π equal to the size of the largest fiber, controls all residual costs of exact identity recovery. This yields a fixed-length converse L ≥ log₂ A_π, an exact finite-block scaling law, a pointwise adaptive budget ⌈log₂ |π^{-1}(u)|⌉, and a fiberwise rate-distortion law for arbitrary finite sources. When all mass lies on one block the uniform formula simplifies to D*(L) = max(0, 1 - 2^L / a). Because the ambiguity is structural, symbolic identity mechanisms become the necessary system-level complement to any non-injective semantic representation.
What carries the argument
The collision-fiber geometry of the representation map π, whose preimage sizes |π^{-1}(u)| determine the exact coding bounds and rate-distortion expressions for identity recovery.
If this is right
- Fixed-length identity codes must use at least log₂ A_π bits.
- Pointwise budgets allocate exactly ⌈log₂ |π^{-1}(u)|⌉ bits per fiber.
- The rate-distortion function takes the closed form D*(L) = max(0, 1 - 2^L / a) when mass concentrates on one block.
- Query complexity for distinguishing families follows the same fiber geometry.
- All laws extend to arbitrary finite sources via recoverable-mass decomposition across fibers.
Where Pith is reading between the lines
- Hybrid AI systems must maintain separate symbolic layers for identity tracking instead of attempting to encode it inside continuous embeddings.
- The same geometry supplies a diagnostic for when existing embedding models will systematically fail at entity disambiguation tasks.
- Database and memory systems already embody these laws through the use of explicit keys and pointers precisely because their indexes are lossy.
- The framework suggests a natural next step: approximate fiber-size estimation for continuous or high-dimensional sources to predict identity-recovery costs.
Load-bearing premise
The collision-fiber sizes of the representation map capture all residual ambiguity that must be resolved for exact identity recovery in any finite source.
What would settle it
A concrete semantic representation and finite source where the measured bits or queries needed for zero-error identity recovery deviate from the formulas based on its observed fiber sizes.
read the original abstract
Symbolic systems operate over precise identities: variables denote specific objects, pointers target precise memory locations, and database keys refer to singular records. Neural embeddings generalize by compressing away semantic detail, but this compression creates collision ambiguity: multiple distinct entities can share the same representation value. Exact identity recovery requires additional information precisely when representation fibers have size greater than one. The residual cost is controlled by a single combinatorial object: the collision-fiber geometry of the representation map $\pi$. Let $A_{\pi}=\max_u |\pi^{-1}(u)|$ be the largest collision fiber. The finite laws include a tight fixed-length converse $L \ge \log_2 A_{\pi}$, an exact finite-block scaling law, a pointwise adaptive budget $\lceil \log_2 |\pi^{-1}(u)|\rceil$, and an exact fiberwise rate-distortion law for arbitrary finite sources via recoverable-mass decomposition across representation fibers. The uniform single-block formula $D^\star(L)=\max(0,1-2^L/a)$ appears as a closed-form special case when all mass lies on one collision block, where $a = A_{\pi}$ is the collision block size. The same fiber geometry determines query complexity and canonical structure for distinguishing families. Because this residual ambiguity is structural rather than representation-specific, symbolic identity mechanisms (handles, keys, pointers, nominal tags) are the necessary system-level complement to any non-injective semantic representation. All main results are machine-checked in Lean 4.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives a family of zero-error compression laws and rate-distortion bounds for semantic identity recovery from a representation map π, controlled by the collision-fiber geometry (A_π = max_u |π^{-1}(u)|). It establishes a tight fixed-length converse L ≥ log₂ A_π, exact finite-block scaling, a pointwise adaptive budget, and a fiberwise rate-distortion law via recoverable-mass decomposition for arbitrary finite sources. The central claim is that symbolic identity mechanisms are the necessary system-level complement to any non-injective semantic representation; all main results are machine-checked in Lean 4.
Significance. If the derivations hold, the work supplies a parameter-free, combinatorial foundation for the structural necessity of symbolic components in neurosymbolic systems. The machine-checked proofs, closed-form expressions such as D^*(L) = max(0, 1 - 2^L/a), and direct applicability to arbitrary finite sources constitute clear strengths that enable falsifiable predictions for representation design.
major comments (1)
- [Abstract and §3 (recoverable-mass decomposition)] The necessity corollary rests on the assumption that collision-fiber geometry fully captures residual identity ambiguity for arbitrary finite sources. While the zero-error converse L ≥ log₂ A_π follows directly from the definition of A_π, the manuscript does not explicitly rule out additional sources of ambiguity (e.g., context-dependent or non-combinatorial) that could arise outside the fiber decomposition; this assumption is load-bearing for the system-level claim.
minor comments (2)
- [Abstract] The uniform single-block formula D^*(L) = max(0, 1 - 2^L/a) is presented as a special case; an explicit statement of the conditions under which all mass lies on one collision block would improve clarity.
- [§4] Notation for the pointwise adaptive budget ⌈log₂ |π^{-1}(u)|⌉ should be cross-referenced to the definition of the fiberwise rate-distortion law to avoid ambiguity in the finite-block scaling section.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the need to clarify the scope of the necessity corollary. We address the comment below and will make a targeted revision to improve precision.
read point-by-point responses
-
Referee: [Abstract and §3 (recoverable-mass decomposition)] The necessity corollary rests on the assumption that collision-fiber geometry fully captures residual identity ambiguity for arbitrary finite sources. While the zero-error converse L ≥ log₂ A_π follows directly from the definition of A_π, the manuscript does not explicitly rule out additional sources of ambiguity (e.g., context-dependent or non-combinatorial) that could arise outside the fiber decomposition; this assumption is load-bearing for the system-level claim.
Authors: We agree that an explicit scoping statement strengthens the presentation. Within the paper's model—arbitrary finite sources X equipped with a fixed representation map π: X → U—the collision fibers π^{-1}(u) exhaust the identity ambiguity by definition: given only the representation value u, the possible preimages are precisely the fiber elements. All derived laws (zero-error converse, finite-block scaling, pointwise budget, and fiberwise rate-distortion via recoverable-mass decomposition) follow directly from this geometry. Context-dependent or non-combinatorial ambiguities would require an extended model (e.g., side information channels or non-finite alphabets) outside the present finite-source combinatorial framework. To make the boundary explicit, we will revise the abstract and the opening of §3 to state: 'for arbitrary finite sources under a representation map π, where residual identity ambiguity is fully captured by the collision-fiber geometry.' This is a minor clarification that does not alter any theorem or proof. revision: partial
Circularity Check
No significant circularity; derivation self-contained via formal verification
full rationale
The paper defines the collision-fiber size A_π = max_u |π^{-1}(u)| as the sole combinatorial object controlling residual ambiguity, then derives the zero-error converse L ≥ log₂ A_π, finite-block scaling, pointwise adaptive budget, and fiberwise rate-distortion law directly from recoverable-mass decomposition over arbitrary finite sources and representation maps π. All central results are machine-checked in Lean 4, providing external formal verification independent of fitted parameters or self-referential equations. The necessity of symbolic mechanisms follows as a corollary once fiber geometry is accepted as the complete description of identity ambiguity; no load-bearing step reduces the claimed laws to their inputs by construction, and no self-citation chain is invoked for the uniqueness or derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard axioms of finite information theory and combinatorics for rate-distortion and fiber decompositions
invented entities (1)
-
collision-fiber geometry
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
A_π := max_u |π^{-1}(u)| ... L ≥ log₂ A_π ... D^*(L) = max(0,1-2^L/a)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
symbolic identity mechanisms ... necessary system-level complement
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A theoretical analysis of contrastive unsupervised representation learning,
N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and N. Khandeparkar, “A theoretical analysis of contrastive unsupervised representation learning,” inProceedings of the 36th International Conference on Machine Learning (ICML), 2019, pp. 5628–5637. [Online]. Available: http://proceedings.mlr.press/v97/saunshi19a.html
work page 2019
-
[2]
β-vae: Learning basic visual concepts with a constrained variational framework,
I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, A. Lerchner, Z. Zhenget al., “β-vae: Learning basic visual concepts with a constrained variational framework,” inInternational Conference on Learning Representations (ICLR), 2017. [Online]. Available: https://openreview.net/forum?id=Hkx9Qkcg
work page 2017
-
[3]
Deep variational information bottleneck,
A. Alemi, A. Fischer, V . Soro, and P. Sontag, “Deep variational information bottleneck,” inInternational Conference on Learning Representations (ICLR), 2017. [Online]. Available: https://openreview. net/forum?id=H1g1Ylxl
work page 2017
-
[4]
Understanding dimensional collapse in contrastive self-supervised learning,
L. Jing, P. Vincent, Y . LeCun, and Y . Tian, “Understanding dimensional collapse in contrastive self-supervised learning,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=YeuYXfR2iOa 79 :NSL1-6
work page 2022
-
[5]
T. Wang and P. Isola, “Understanding contrastive representation learning through alignment and uniformity on the hypersphere,” in International Conference on Machine Learning (ICML). PMLR, 2020, pp. 9929–9939. [Online]. Available: http://proceedings.mlr.press/v119/ wang20k.html
work page 2020
-
[6]
Abduction-based explanations for machine learning models,
A. Ignatiev, N. Narodytska, and J. Marques-Silva, “Abduction-based explanations for machine learning models,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 1, 2019, pp. 1511– 1519
work page 2019
-
[7]
Delivering trustworthy AI through formal XAI,
J. Marques-Silva and A. Ignatiev, “Delivering trustworthy AI through formal XAI,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 11, 2022, pp. 12 342–12 350
work page 2022
-
[8]
C. Dwork, “Differential privacy,” inInternational Colloquium on Au- tomata, Languages, and Programming (ICALP). Springer, 2006, pp. 1–12
work page 2006
-
[9]
Noiseless coding of correlated information sources,
D. Slepian and J. Wolf, “Noiseless coding of correlated information sources,”IEEE Transactions on Information Theory, vol. 19, no. 4, pp. 471–480, 1973
work page 1973
-
[10]
The rate-distortion function for source coding with side information at the decoder,
A. D. Wyner and J. Ziv, “The rate-distortion function for source coding with side information at the decoder,”IEEE Transactions on Information Theory, vol. 22, no. 1, pp. 1–10, 1976
work page 1976
-
[11]
Source coding theory for cascade and branching com- munication systems,
H. Yamamoto, “Source coding theory for cascade and branching com- munication systems,”IEEE Transactions on Information Theory, vol. 27, no. 3, pp. 299–308, 1981
work page 1981
-
[12]
Gaussian multiterminal source coding,
Y . Oohama, “Gaussian multiterminal source coding,”IEEE Transactions on Information Theory, vol. 43, no. 6, pp. 1912–1923, 1997
work page 1912
-
[13]
The Wyner–Ziv problem with coded side information,
M. Gastpar, “The Wyner–Ziv problem with coded side information,” IEEE Transactions on Information Theory, vol. 50, no. 11, pp. 2762– 2768, 2004
work page 2004
-
[14]
The zero error capacity of a noisy channel,
C. E. Shannon, “The zero error capacity of a noisy channel,”IRE Transactions on Information Theory, vol. 2, no. 3, pp. 8–19, 1956
work page 1956
-
[15]
On the shannon capacity of a graph,
L. Lovász, “On the shannon capacity of a graph,”IEEE Transactions on Information Theory, vol. 25, no. 1, pp. 1–7, 1979
work page 1979
-
[16]
D. Gund"uz, Z. Qin, I. Estella Aguerri, H. S. Dhillon, Z. Yang, A. Yener, K.-K. Wong, and C.-B. Chae, “Beyond transmitting bits: Context, semantics, and task-oriented communications,”IEEE Journal on Selected Areas in Communications, vol. 41, no. 1, pp. 5–41, 2023
work page 2023
-
[17]
Semantic communications in networked systems: A data significance perspective,
E. Uysal, O. Kaya, A. Ephremideset al., “Semantic communications in networked systems: A data significance perspective,”IEEE Network, vol. 36, no. 4, pp. 233–240, 2022
work page 2022
-
[18]
Towards a theory of semantic communication,
J. Bao, P. Basu, M. Dean, C. Partridge, A. Swami, and W. Leland, “Towards a theory of semantic communication,” in2011 IEEE Network Science Workshop. IEEE, 2011, pp. 110–117
work page 2011
-
[19]
Semantics-empowered communication for networked intelligent systems,
M. Kountouris and N. Pappas, “Semantics-empowered communication for networked intelligent systems,”IEEE Communications Magazine, vol. 59, no. 6, pp. 96–102, 2021
work page 2021
-
[20]
The information bottleneck method
N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck method,”arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[21]
Rethinking lossy compression: The rate- distortion-perception tradeoff,
Y . Blau and T. Michaeli, “Rethinking lossy compression: The rate- distortion-perception tradeoff,” inProceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97, 2019, pp. 675–685. [Online]. Available: https://proceedings.mlr.press/v97/blau19a.html
work page 2019
-
[22]
On the rate-distortion-perception function,
J. Chen, L. Yu, J. Wang, W. Shi, Y . Ge, and W. Tong, “On the rate-distortion-perception function,”IEEE Journal on Selected Areas in Information Theory, vol. 3, no. 4, pp. 664–673, 2022
work page 2022
-
[23]
Output-constrained lossy source coding with application to rate-distortion-perception theory,
L. Xie, L. Li, J. Chen, and Z. Zhang, “Output-constrained lossy source coding with application to rate-distortion-perception theory,” IEEE Transactions on Communications, vol. 73, no. 3, pp. 1801–1815, 2025
work page 2025
-
[24]
Perfect graphs and graph entropy,
G. Simonyi, “Perfect graphs and graph entropy,”Combinatorica, vol. 15, no. 4, pp. 541–549, 1995
work page 1995
-
[25]
Source coding and graph entropies,
N. Alon and A. Orlitsky, “Source coding and graph entropies,”IEEE Transactions on Information Theory, vol. 42, no. 5, pp. 1329–1339, 1996
work page 1996
-
[26]
An upper bound for the shannon capacity of a graph,
C. Haemers, “An upper bound for the shannon capacity of a graph,” Colloquium Mathematicum Societatis János Bolyai, vol. 25, pp. 267– 272, 1978
work page 1978
-
[27]
The asymptotic spectrum of graphs and the shannon capacity,
J. Zuiddam, “The asymptotic spectrum of graphs and the shannon capacity,”Combinatorica, vol. 39, pp. 1173–1184, 2019
work page 2019
-
[28]
The zero-error side information problem and chromatic numbers,
H. S. Witsenhausen, “The zero-error side information problem and chromatic numbers,”IEEE Transactions on Information Theory, vol. 22, no. 5, pp. 592–593, 1976
work page 1976
-
[29]
Channel coding rate in the finite blocklength regime,
Y . Polyanskiy, H. V . Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,”IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010. 13
work page 2010
-
[30]
I. Csiszár and J. Körner,Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed. Cambridge University Press, 2011
work page 2011
-
[31]
D. J. A. Welsh,Matroid Theory. Academic Press, 1976
work page 1976
-
[32]
T. M. Cover and J. A. Thomas,Elements of Information Theory, 2nd ed. Wiley-Interscience, 2006
work page 2006
-
[33]
A. Barron, M. J. Schervish, and L. Wasserman, “The information- theoretic characterization of Bayes performance and the choice of priors in parametric and nonparametric problems,”Bayesian Statistics, vol. 6, pp. 27–52, 1998
work page 1998
-
[34]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschelet al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020, pp. 9459–9474. [Online]. Available: https://proceedings.neurips.cc/paper/ 2020/hash/6b4...
work page 2020
-
[35]
A. Orlitsky and J. R. Roche, “Coding for computing,”IEEE Transactions on Information Theory, vol. 47, no. 3, pp. 903–917, 2001
work page 2001
-
[36]
Functional compression through graph coloring,
V . Doshi, D. Shah, M. Médard, and M. Effros, “Functional compression through graph coloring,”IEEE Transactions on Information Theory, vol. 56, no. 8, pp. 3901–3917, 2010
work page 2010
-
[37]
R. Ahlswede and G. Dueck, “Identification via channels,”IEEE Trans- actions on Information Theory, vol. 35, no. 1, pp. 15–29, 1989
work page 1989
-
[38]
Coding of an information source having ambiguous alphabet and the entropy of graphs,
J. Körner, “Coding of an information source having ambiguous alphabet and the entropy of graphs,”6th Prague Conference on Information Theory, pp. 411–425, 1973
work page 1973
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.