pith. sign in

arxiv: 2603.25306 · v2 · submitted 2026-03-26 · 💻 cs.DB

JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness

Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3

classification 💻 cs.DB
keywords JSON Schemaschema inclusionrefutational normalizationwitness generationcompletenessefficiencyschema compatibility
0
0 comments X

The pith

Refutational normalization reconciles efficiency and completeness for JSON Schema inclusion checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets the problem of checking JSON Schema inclusion, that is, whether every document valid under one schema is also valid under another. Rule-based methods run quickly but miss some inclusions, while witness-generation methods are complete yet often too slow for practical use. The authors enrich witness generation with a specialized normalization step that refutes potential counterexamples early. This hybrid approach delivers both the speed of incomplete methods and the guarantees of complete ones. Experiments on real-world and synthetic schemas confirm large efficiency gains and enable previously intractable checks.

Core claim

By enriching witness-generation techniques with refutational normalization, the method achieves both the efficiency of rule-based procedures and the completeness of witness generation for determining JSON Schema inclusion.

What carries the argument

Refutational normalization, a specialized normalization process integrated into witness generation that refutes potential mismatches to retain completeness while improving runtime performance.

If this is right

  • Inclusion checks for schemas too complex for existing complete tools become feasible in practice.
  • Tasks such as API version compatibility, schema refactoring, and large-scale corpus analysis gain reliable automation.
  • Experiments show the method advances the state of the art by combining speed and completeness on both real and synthetic inputs.
  • The technique supports use cases previously limited by either incompleteness or excessive runtime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same normalization idea could transfer to inclusion problems in other schema languages such as XML Schema.
  • Adoption in schema-management tools would accelerate workflows involving data migration and validation.
  • Further tuning for common real-world schema patterns could yield additional speedups.

Load-bearing premise

Refutational normalization preserves the completeness of witness generation while producing substantial efficiency gains across real and synthetic schemas.

What would settle it

A pair of JSON schemas on which the new method either reports an incorrect inclusion result or fails to run substantially faster than prior complete witness-generation algorithms.

Figures

Figures reproduced from arXiv: 2603.25306 by Carlo Sartiani, Dario Colazzo, Giorgio Ghelli, Mohamed-Amine Baazizi, Nour El Houda Ben Ali, Stefanie Scherzinger, Stefan Klessinger.

Figure 1
Figure 1. Figure 1: Performance comparison of rule-based (RB), [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Structural rules [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Grammar of canonical conjunctions. presented in Section 4.1 : a canonical conjunction𝐶 is either a type￾set conjunction 𝐶𝑇𝑆 or a typed conjunction (𝐶𝑂, 𝐶𝐴, etc.) of strat￾ified typed operators. C-references. The normalization process creates new references when it merges related operators. For example, assume we normal￾ize the following (𝑆, 𝐸) document: ( all[ all[type(object), pProp( ⌊𝑎⌋ : ref(𝑥)) ], all[… view at source ↗
Figure 4
Figure 4. Figure 4: notPush(S). Function allCS(𝐶, 𝑆, 𝐸) works by cases on 𝑆. It just applies dis￾tributivity when 𝑆 = any[. . .] (lines 8 - 11) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure rates of tools across datasets (lower is bet [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

JSON Schema is the de facto standard for describing the structure of JSON documents. Reasoning about JSON Schema inclusion -- whether every instance satisfying a schema S1 also satisfies a schema S2 -- is a key building block for a variety of tasks, including version and API compatibility checks, schema refactoring tools, and large-scale schema corpus analysis. Existing approaches fall into two families: rule-based algorithms that are efficient but incomplete and witness generation-based algorithms that are complete but oftentimes extremely slow. This paper introduces a new approach that reconciles the efficiency of rule-based procedures with the completeness of the witness-generation technique, by enriching the latter with a specialized form of normalization. This refutational normalization paves the way for use-cases that are too hard for current tools. Our experiments with real-world and synthetic schemas show that the refutational normalization greatly advances the state-of-the-art in JSON Schema inclusion checking.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces refutational normalization as a preprocessing step applied before witness generation for deciding JSON Schema inclusion (whether every instance of schema S1 satisfies S2). It claims this reconciles the efficiency of rule-based methods with the completeness of witness generation, with experiments on real-world and synthetic schemas showing substantial runtime improvements over prior approaches.

Significance. If the normalization rules preserve the exact set of counterexamples for non-inclusion, the technique would make complete inclusion checking practical for complex schemas, directly benefiting schema compatibility analysis, API versioning, and large-scale corpus studies where current tools are either incomplete or too slow.

major comments (2)
  1. The central claim requires that refutational normalization preserves completeness, i.e., if S1 ⊈ S2 then the normalized pair still admits a witness of non-inclusion. No reduction to a known complete procedure, inductive argument over JSON Schema constructs (e.g., anyOf combined with additionalProperties), or semantic-preservation proof is supplied, leaving the completeness guarantee unestablished.
  2. Experimental evaluation reports speed-ups but supplies no quantitative verification that completeness was retained (e.g., no count of cases where a witness was found post-normalization that would have been missed by rule-based methods alone, and no baseline comparison details on the witness-generation implementation used).
minor comments (2)
  1. The abstract asserts 'greatly advances the state-of-the-art' without citing specific speed-up factors or schema sizes; adding one or two concrete numbers would strengthen the claim.
  2. Normalization rules should be presented with explicit formal definitions (e.g., as rewrite rules with preconditions) rather than informal descriptions to support reproducibility and future proof attempts.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We will revise the manuscript to supply the missing formal argument for completeness preservation and to strengthen the experimental section with quantitative completeness checks.

read point-by-point responses
  1. Referee: The central claim requires that refutational normalization preserves completeness, i.e., if S1 ⊈ S2 then the normalized pair still admits a witness of non-inclusion. No reduction to a known complete procedure, inductive argument over JSON Schema constructs (e.g., anyOf combined with additionalProperties), or semantic-preservation proof is supplied, leaving the completeness guarantee unestablished.

    Authors: We agree that the current manuscript lacks an explicit completeness argument. In the revision we will add a semantic-preservation proof by induction on JSON Schema constructs. The argument will show that each normalization rule preserves the existence of counterexample witnesses (when they exist) and will explicitly treat combinations such as anyOf with additionalProperties by demonstrating that no valid counterexample is eliminated. revision: yes

  2. Referee: Experimental evaluation reports speed-ups but supplies no quantitative verification that completeness was retained (e.g., no count of cases where a witness was found post-normalization that would have been missed by rule-based methods alone, and no baseline comparison details on the witness-generation implementation used).

    Authors: We acknowledge that the experiments do not yet report quantitative completeness verification. In the revised version we will add tables counting the additional witnesses discovered after normalization (i.e., cases where rule-based methods alone would have failed to detect non-inclusion) and will document the exact witness-generation implementation, its configuration parameters, and how it was used as the complete baseline. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided abstract and description introduce refutational normalization as a technique to reconcile efficiency and completeness for JSON Schema inclusion checking. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are visible. The central claim rests on experimental speed-ups and an asserted preservation property, but the text supplies no derivation that reduces the result to its own inputs by construction. This is a standard case of an independent algorithmic contribution without circular structure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5476 in / 941 out tokens · 35863 ms · 2026-05-15T00:55:50.437160+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages

  1. [1]

    JSON schema validator

    2026. JSON schema validator. https://github.com/networknt/json-schema- validator Retrieved 12 February 2026

  2. [2]

    jsonschema

    2026. jsonschema. https://github.com/python-jsonschema/jsonschema Re- trieved 12 February 2026

  3. [3]

    Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Stefan Klessinger, Carlo Sartiani, and Stefanie Scherzinger. 2026. Elimination of anno- tation dependencies in validation for Modern JSON Schema. Theor. Comput. Sci. 1063 (2026), 115645. https://doi.org/10.1016/J.TCS.2025.115645

  4. [4]

    Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Yunchen Ding, Michael Fruth, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Test Suite for JSON Schema Containment. In Proc. ER 2021. 19–24

  5. [5]

    Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Ste- fan Klessinger, Carlo Sartiani, and Stefanie Scherzinger. 2022. Repro- duction Package. Available on Zenodo at https://doi.org/10.5281/ zenodo.7106750 and maintained on GitHub at https://github.com/sdbs-uni-p/ JSONSchemaWitnessGeneration

  6. [6]

    Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2022. Witness Generation for JSON Schema. JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness Proc. VLDB Endow. 15, 13 (2022), 4002–4014. https://www.vldb.org/pvldb/vol15/ p4002-sartiani.pdf

  7. [7]

    Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2023. Reproduction Package for: Validation of Modern JSON Schema: Formalization and Complexity . https://doi.org/10.5281/ zenodo.10019663

  8. [8]

    Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2026. Witness Generation for Classical JSON Schema. ACM Trans. Database Syst. (Feb. 2026). https://doi.org/10.1145/3799416 Just Accepted

  9. [9]

    McGuinness, Daniele Nardi, and Pe- ter F

    Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Pe- ter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory, Im- plementation, and Applications. Cambridge University Press

  10. [10]

    Franz Baader and Ulrike Sattler. 2001. An Overview of Tableau Algorithms for Description Logics. Stud Logica 69, 1 (2001), 5–40. https://doi.org/10.1023/A: 1013882326814

  11. [11]

    Mohamed Amine Baazizi, Clément Berti, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2020. Human-in-the-Loop Schema Inference for Massive JSON Datasets. In EDBT. OpenProceedings.org, 635–638

  12. [12]

    Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497–521

  13. [13]

    Scalable Funding of Bitcoin Micropayment Channel Networks

    Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Ste- fanie Scherzinger. 2021. An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents. In Proc. ER. 102–112. https://doi.org/10.1007/978-3- 030-89022-3_9

  14. [14]

    Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Ste- fanie Scherzinger. 2023. Negation-closure for JSON Schema. Theor. Comput. Sci. 955 (2023), 113823. https://doi.org/10.1016/j.tcs.2023.113823

  15. [15]

    Fernando Suárez Barría. 2016. Formal Specification, Expressiveness, and Complex- ity Analysis for JSON Schema . Master’s thesis. Pontificia Universidad Católica de Chile, Santiago, Chile. https://repositorio.uc.cl/handle/11534/16908

  16. [16]

    Guillaume Baudart, Martin Hirzel, Kiran Kate, Parikshit Ram, and Avraham Shin- nar. 2020. Lale: Consistent Automated Machine Learning. CoRR abs/2007.01977 (2020). arXiv:2007.01977 https://arxiv.org/abs/2007.01977

  17. [17]

    Filippo Bonchi and Damien Pous. 2013. Checking NFA equivalence with bisim- ulations up to congruence. In The 40th Annual ACM SIGPLAN-SIGACT Sympo- sium on Principles of Programming Languages, POPL ’13 . ACM, 457–468. https: //doi.org/10.1145/2429069.2429124

  18. [18]

    Ahmed Bouajjani, Peter Habermehl, Lukás Holík, Tayssir Touili, and Tomás Vo- jnar. 2008. Antichain-Based Universality and Inclusion Testing over Nondeter- ministic Finite Tree Automata. In Implementation and Applications of Automata, 13th International Conference, CIAA 2008 (Lecture Notes in Computer Science) . Springer, 57–67. https://doi.org/10.1007/978...

  19. [19]

    Reutter, Fernando Suárez, and Domagoj Vrgoč

    Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. 2017. JSON: Data model, Query languages and Schema specification. In Proc. PODS . 123–135. https://doi.org/10.1145/3034786.3056120

  20. [20]

    Giuseppe Castagna and Alain Frisch. 2005. A gentle introduction to semantic subtyping. In Proceedings of the 7th International ACM SIGPLAN Conference on Principles and Practice of Declarative Programming . ACM, 198–199. https://doi. org/10.1145/1069774.1069793

  21. [21]

    Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen

    Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. CoRR abs/2411.15100 (2024). https://doi. org/10.48550/ARXIV.2411.15100 arXiv:2411.15100

  22. [22]

    Alain Frisch. 2004. Théorie, conception et réalisation d’un langage de program- mation fonctionnel adapté à XML. Ph. D. Dissertation. Université Paris 7 – Denis Diderot

  23. [23]

    Alain Frisch, Giuseppe Castagna, and Véronique Benzaken. 2002. Semantic Sub- typing. In 17th IEEE Symposium on Logic in Computer Science (LICS 2002), . 137–

  24. [24]

    https://doi.org/10.1109/LICS.2002.1029823

  25. [25]

    Alain Frisch, Giuseppe Castagna, and Véronique Benzaken. 2008. Semantic sub- typing: Dealing set-theoretically with function, union, intersection, and nega- tion types. J. ACM 55, 4 (2008), 19:1–19:64. https://doi.org/10.1145/1391289. 1391293

  26. [26]

    Pierre Genevès, Nabil Layaïda, and Alan Schmitt. 2007. Efficient static analysis of XML paths and types. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, Jeanne Ferrante and Kathryn S. McKinley (Eds.). ACM, 342–351. https://doi.org/10.1145/1250734.1250773

  27. [27]

    Li, and Konstantinos Mamouras

    Alexis Le Glaunec, Angela W. Li, and Konstantinos Mamouras. 2026. Streaming Validation of JSON Documents against Schemas.Proc. VLDB Endow.19, 3 (March 2026), 509–522. https://doi.org/10.14778/3778092.3778109

  28. [28]

    Andrew Habib, Avraham Shinnar, Martin Hirzel, and Michael Pradel. 2021. Find- ing Data Compatibility Bugs with JSON Subschema Checking. In Proc. ISSTA. 620–632. https://doi.org/10.1145/3460319.3464796

  29. [29]

    Patel-Schneider

    Ian Horrocks and Peter F. Patel-Schneider. 1999. Optimizing Description Logic Subsumption. J. Log. Comput. 9, 3 (1999), 267–293. https://doi.org/10.1093/ LOGCOM/9.3.267

  30. [30]

    Haruo Hosoya. 2001. Regular Expression Types for XML. Ph. D. Dissertation. The University of Tokyo

  31. [31]

    Haruo Hosoya and Makoto Murata. 2003. Boolean Operations for Attribute- Element Constraints. In Implementation and Application of Automata, 8th Inter- national Conference, CIAA 2003, Santa Barbara, California, USA, July 16-18, 2003, Proceedings (Lecture Notes in Computer Science) , Oscar H. Ibarra and Zhe Dang (Eds.). Springer, 201–212. https://doi.org/10...

  32. [32]

    Haruo Hosoya and Benjamin C. Pierce. 2001. Regular expression pattern match- ing for XML. In Conference Record of POPL 2001: The 28th ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, London, UK, Jan- uary 17-19, 2001 , Chris Hankin and Dave Schmidt (Eds.). ACM, 67–80. https: //doi.org/10.1145/360204.360209

  33. [33]

    Haruo Hosoya and Benjamin C. Pierce. 2003. XDuce: A statically typed XML processing language. ACM Trans. Internet Techn. 3, 2 (2003), 117–148. https: //doi.org/10.1145/767193.767195

  34. [34]

    Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. 2005. Regular ex- pression types for XML. ACM Trans. Program. Lang. Syst. 27, 1 (2005), 46–90. https://doi.org/10.1145/1053468.1053470

  35. [35]

    Stefan Klessinger, Michael Fruth, Valentin Gittinger, Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2023. Tagger: A Tool for the Discovery of Tagged Unions in JSON Schema Extraction. InProceedings 26th International Conference on Extending Database Technology, EDBT 2023, Ioannina, Greece, March 28-31,

  36. [36]

    https://doi.org/10.48786/EDBT.2023.75

    OpenProceedings.org, 827–830. https://doi.org/10.48786/EDBT.2023.75

  37. [37]

    Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores. In Proc. BTW (LNI, Vol. P-241). 425–444

  38. [38]

    Yu Liu, Duantengchuan Li, Kaili Wang, Zhuoran Xiong, Fobo Shi, Jian Wang, Bing Li, and Bo Hang. 2024. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs. Inf. Process. Manag. 61, 5 (2024), 103809. https://doi.org/10.1016/J.IPM.2024.103809

  39. [39]

    Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc

    Felipe Pezoa, Juan L. Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc. 2016. Foundations of JSON Schema. In Proc. WWW. 263–273

  40. [40]

    William Spoth, Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, and Zhen Hua Liu. 2021. Reducing Ambiguity in Json Schema Discovery. InSIGMOD Conference. ACM, 1732–1744

  41. [41]

    Tadahiro Suda and Haruo Hosoya. 2005. Non-backtracking Top-Down Algo- rithm for Checking Tree Automata Containment. In Implementation and Appli- cation of Automata, 10th International Conference, CIAA 2005, Sophia Antipolis, France, June 27-29, 2005, Revised Selected Papers (Lecture Notes in Computer Sci- ence), Jacques Farré, Igor Litovsky, and Sylvain S...

  42. [42]

    Henzinger, and Jean-François Raskin

    Martin De Wulf, Laurent Doyen, Thomas A. Henzinger, and Jean-François Raskin. 2006. Antichains: A New Algorithm for Checking Universality of Fi- nite Automata. In Computer Aided Verification, 18th International Conference, CA V 2006, Seattle, W A, USA, August 17-20, 2006, Proceedings (Lecture Notes in Computer Science) , Thomas Ball and Robert B. Jones (E...

  43. [43]

    Joohyung Yun, Byungchul Tak, and Wook-Shin Han. 2024. ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework. Proc. VLDB Endow.17, 11 (July 2024), 3538–3550. https://doi.org/10.14778/3681954. 3682019