JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness
Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3
The pith
Refutational normalization reconciles efficiency and completeness for JSON Schema inclusion checks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By enriching witness-generation techniques with refutational normalization, the method achieves both the efficiency of rule-based procedures and the completeness of witness generation for determining JSON Schema inclusion.
What carries the argument
Refutational normalization, a specialized normalization process integrated into witness generation that refutes potential mismatches to retain completeness while improving runtime performance.
If this is right
- Inclusion checks for schemas too complex for existing complete tools become feasible in practice.
- Tasks such as API version compatibility, schema refactoring, and large-scale corpus analysis gain reliable automation.
- Experiments show the method advances the state of the art by combining speed and completeness on both real and synthetic inputs.
- The technique supports use cases previously limited by either incompleteness or excessive runtime.
Where Pith is reading between the lines
- The same normalization idea could transfer to inclusion problems in other schema languages such as XML Schema.
- Adoption in schema-management tools would accelerate workflows involving data migration and validation.
- Further tuning for common real-world schema patterns could yield additional speedups.
Load-bearing premise
Refutational normalization preserves the completeness of witness generation while producing substantial efficiency gains across real and synthetic schemas.
What would settle it
A pair of JSON schemas on which the new method either reports an incorrect inclusion result or fails to run substantially faster than prior complete witness-generation algorithms.
Figures
read the original abstract
JSON Schema is the de facto standard for describing the structure of JSON documents. Reasoning about JSON Schema inclusion -- whether every instance satisfying a schema S1 also satisfies a schema S2 -- is a key building block for a variety of tasks, including version and API compatibility checks, schema refactoring tools, and large-scale schema corpus analysis. Existing approaches fall into two families: rule-based algorithms that are efficient but incomplete and witness generation-based algorithms that are complete but oftentimes extremely slow. This paper introduces a new approach that reconciles the efficiency of rule-based procedures with the completeness of the witness-generation technique, by enriching the latter with a specialized form of normalization. This refutational normalization paves the way for use-cases that are too hard for current tools. Our experiments with real-world and synthetic schemas show that the refutational normalization greatly advances the state-of-the-art in JSON Schema inclusion checking.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces refutational normalization as a preprocessing step applied before witness generation for deciding JSON Schema inclusion (whether every instance of schema S1 satisfies S2). It claims this reconciles the efficiency of rule-based methods with the completeness of witness generation, with experiments on real-world and synthetic schemas showing substantial runtime improvements over prior approaches.
Significance. If the normalization rules preserve the exact set of counterexamples for non-inclusion, the technique would make complete inclusion checking practical for complex schemas, directly benefiting schema compatibility analysis, API versioning, and large-scale corpus studies where current tools are either incomplete or too slow.
major comments (2)
- The central claim requires that refutational normalization preserves completeness, i.e., if S1 ⊈ S2 then the normalized pair still admits a witness of non-inclusion. No reduction to a known complete procedure, inductive argument over JSON Schema constructs (e.g., anyOf combined with additionalProperties), or semantic-preservation proof is supplied, leaving the completeness guarantee unestablished.
- Experimental evaluation reports speed-ups but supplies no quantitative verification that completeness was retained (e.g., no count of cases where a witness was found post-normalization that would have been missed by rule-based methods alone, and no baseline comparison details on the witness-generation implementation used).
minor comments (2)
- The abstract asserts 'greatly advances the state-of-the-art' without citing specific speed-up factors or schema sizes; adding one or two concrete numbers would strengthen the claim.
- Normalization rules should be presented with explicit formal definitions (e.g., as rewrite rules with preconditions) rather than informal descriptions to support reproducibility and future proof attempts.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback. We will revise the manuscript to supply the missing formal argument for completeness preservation and to strengthen the experimental section with quantitative completeness checks.
read point-by-point responses
-
Referee: The central claim requires that refutational normalization preserves completeness, i.e., if S1 ⊈ S2 then the normalized pair still admits a witness of non-inclusion. No reduction to a known complete procedure, inductive argument over JSON Schema constructs (e.g., anyOf combined with additionalProperties), or semantic-preservation proof is supplied, leaving the completeness guarantee unestablished.
Authors: We agree that the current manuscript lacks an explicit completeness argument. In the revision we will add a semantic-preservation proof by induction on JSON Schema constructs. The argument will show that each normalization rule preserves the existence of counterexample witnesses (when they exist) and will explicitly treat combinations such as anyOf with additionalProperties by demonstrating that no valid counterexample is eliminated. revision: yes
-
Referee: Experimental evaluation reports speed-ups but supplies no quantitative verification that completeness was retained (e.g., no count of cases where a witness was found post-normalization that would have been missed by rule-based methods alone, and no baseline comparison details on the witness-generation implementation used).
Authors: We acknowledge that the experiments do not yet report quantitative completeness verification. In the revised version we will add tables counting the additional witnesses discovered after normalization (i.e., cases where rule-based methods alone would have failed to detect non-inclusion) and will document the exact witness-generation implementation, its configuration parameters, and how it was used as the complete baseline. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The provided abstract and description introduce refutational normalization as a technique to reconcile efficiency and completeness for JSON Schema inclusion checking. No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are visible. The central claim rests on experimental speed-ups and an asserted preservation property, but the text supplies no derivation that reduces the result to its own inputs by construction. This is a standard case of an independent algorithmic contribution without circular structure.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a notion of refutational normalization, an approach to normalization that internalizes the efficient behavior of the inclusion rules, that is optimized for a fast detection of situations where the normalized term is not satisfiable
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DNF(all[S1, not(S2)]) ... lazy normalization ... fast complement-absorption ... eager reference-evaluation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2026. JSON schema validator. https://github.com/networknt/json-schema- validator Retrieved 12 February 2026
work page 2026
-
[2]
2026. jsonschema. https://github.com/python-jsonschema/jsonschema Re- trieved 12 February 2026
work page 2026
-
[3]
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Stefan Klessinger, Carlo Sartiani, and Stefanie Scherzinger. 2026. Elimination of anno- tation dependencies in validation for Modern JSON Schema. Theor. Comput. Sci. 1063 (2026), 115645. https://doi.org/10.1016/J.TCS.2025.115645
-
[4]
Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Yunchen Ding, Michael Fruth, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2021. A Test Suite for JSON Schema Containment. In Proc. ER 2021. 19–24
work page 2021
-
[5]
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Ste- fan Klessinger, Carlo Sartiani, and Stefanie Scherzinger. 2022. Repro- duction Package. Available on Zenodo at https://doi.org/10.5281/ zenodo.7106750 and maintained on GitHub at https://github.com/sdbs-uni-p/ JSONSchemaWitnessGeneration
work page 2022
-
[6]
Lyes Attouche, Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2022. Witness Generation for JSON Schema. JSON Schema Inclusion through Refutational Normalization: Reconciling Efficiency and Completeness Proc. VLDB Endow. 15, 13 (2022), 4002–4014. https://www.vldb.org/pvldb/vol15/ p4002-sartiani.pdf
work page 2022
-
[7]
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2023. Reproduction Package for: Validation of Modern JSON Schema: Formalization and Complexity . https://doi.org/10.5281/ zenodo.10019663
work page 2023
-
[8]
Lyes Attouche, Mohamed-Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Stefanie Scherzinger. 2026. Witness Generation for Classical JSON Schema. ACM Trans. Database Syst. (Feb. 2026). https://doi.org/10.1145/3799416 Just Accepted
-
[9]
McGuinness, Daniele Nardi, and Pe- ter F
Franz Baader, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi, and Pe- ter F. Patel-Schneider (Eds.). 2003. The Description Logic Handbook: Theory, Im- plementation, and Applications. Cambridge University Press
work page 2003
-
[10]
Franz Baader and Ulrike Sattler. 2001. An Overview of Tableau Algorithms for Description Logics. Stud Logica 69, 1 (2001), 5–40. https://doi.org/10.1023/A: 1013882326814
work page doi:10.1023/a: 2001
-
[11]
Mohamed Amine Baazizi, Clément Berti, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2020. Human-in-the-Loop Schema Inference for Massive JSON Datasets. In EDBT. OpenProceedings.org, 635–638
work page 2020
-
[12]
Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, and Carlo Sartiani. 2019. Parametric schema inference for massive JSON datasets. VLDB J. 28, 4 (2019), 497–521
work page 2019
-
[13]
Scalable Funding of Bitcoin Micropayment Channel Networks
Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Ste- fanie Scherzinger. 2021. An Empirical Study on the "Usage of Not" in Real-World JSON Schema Documents. In Proc. ER. 102–112. https://doi.org/10.1007/978-3- 030-89022-3_9
-
[14]
Mohamed Amine Baazizi, Dario Colazzo, Giorgio Ghelli, Carlo Sartiani, and Ste- fanie Scherzinger. 2023. Negation-closure for JSON Schema. Theor. Comput. Sci. 955 (2023), 113823. https://doi.org/10.1016/j.tcs.2023.113823
-
[15]
Fernando Suárez Barría. 2016. Formal Specification, Expressiveness, and Complex- ity Analysis for JSON Schema . Master’s thesis. Pontificia Universidad Católica de Chile, Santiago, Chile. https://repositorio.uc.cl/handle/11534/16908
work page 2016
- [16]
-
[17]
Filippo Bonchi and Damien Pous. 2013. Checking NFA equivalence with bisim- ulations up to congruence. In The 40th Annual ACM SIGPLAN-SIGACT Sympo- sium on Principles of Programming Languages, POPL ’13 . ACM, 457–468. https: //doi.org/10.1145/2429069.2429124
-
[18]
Ahmed Bouajjani, Peter Habermehl, Lukás Holík, Tayssir Touili, and Tomás Vo- jnar. 2008. Antichain-Based Universality and Inclusion Testing over Nondeter- ministic Finite Tree Automata. In Implementation and Applications of Automata, 13th International Conference, CIAA 2008 (Lecture Notes in Computer Science) . Springer, 57–67. https://doi.org/10.1007/978...
-
[19]
Reutter, Fernando Suárez, and Domagoj Vrgoč
Pierre Bourhis, Juan L. Reutter, Fernando Suárez, and Domagoj Vrgoc. 2017. JSON: Data model, Query languages and Schema specification. In Proc. PODS . 123–135. https://doi.org/10.1145/3034786.3056120
-
[20]
Giuseppe Castagna and Alain Frisch. 2005. A gentle introduction to semantic subtyping. In Proceedings of the 7th International ACM SIGPLAN Conference on Principles and Practice of Declarative Programming . ACM, 198–199. https://doi. org/10.1145/1069774.1069793
-
[21]
Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen
Yixin Dong, Charlie F. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen. 2024. XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models. CoRR abs/2411.15100 (2024). https://doi. org/10.48550/ARXIV.2411.15100 arXiv:2411.15100
-
[22]
Alain Frisch. 2004. Théorie, conception et réalisation d’un langage de program- mation fonctionnel adapté à XML. Ph. D. Dissertation. Université Paris 7 – Denis Diderot
work page 2004
-
[23]
Alain Frisch, Giuseppe Castagna, and Véronique Benzaken. 2002. Semantic Sub- typing. In 17th IEEE Symposium on Logic in Computer Science (LICS 2002), . 137–
work page 2002
-
[24]
https://doi.org/10.1109/LICS.2002.1029823
-
[25]
Alain Frisch, Giuseppe Castagna, and Véronique Benzaken. 2008. Semantic sub- typing: Dealing set-theoretically with function, union, intersection, and nega- tion types. J. ACM 55, 4 (2008), 19:1–19:64. https://doi.org/10.1145/1391289. 1391293
-
[26]
Pierre Genevès, Nabil Layaïda, and Alan Schmitt. 2007. Efficient static analysis of XML paths and types. InProceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, San Diego, California, USA, June 10-13, 2007, Jeanne Ferrante and Kathryn S. McKinley (Eds.). ACM, 342–351. https://doi.org/10.1145/1250734.1250773
-
[27]
Alexis Le Glaunec, Angela W. Li, and Konstantinos Mamouras. 2026. Streaming Validation of JSON Documents against Schemas.Proc. VLDB Endow.19, 3 (March 2026), 509–522. https://doi.org/10.14778/3778092.3778109
-
[28]
Andrew Habib, Avraham Shinnar, Martin Hirzel, and Michael Pradel. 2021. Find- ing Data Compatibility Bugs with JSON Subschema Checking. In Proc. ISSTA. 620–632. https://doi.org/10.1145/3460319.3464796
-
[29]
Ian Horrocks and Peter F. Patel-Schneider. 1999. Optimizing Description Logic Subsumption. J. Log. Comput. 9, 3 (1999), 267–293. https://doi.org/10.1093/ LOGCOM/9.3.267
work page 1999
-
[30]
Haruo Hosoya. 2001. Regular Expression Types for XML. Ph. D. Dissertation. The University of Tokyo
work page 2001
-
[31]
Haruo Hosoya and Makoto Murata. 2003. Boolean Operations for Attribute- Element Constraints. In Implementation and Application of Automata, 8th Inter- national Conference, CIAA 2003, Santa Barbara, California, USA, July 16-18, 2003, Proceedings (Lecture Notes in Computer Science) , Oscar H. Ibarra and Zhe Dang (Eds.). Springer, 201–212. https://doi.org/10...
-
[32]
Haruo Hosoya and Benjamin C. Pierce. 2001. Regular expression pattern match- ing for XML. In Conference Record of POPL 2001: The 28th ACM SIGPLAN- SIGACT Symposium on Principles of Programming Languages, London, UK, Jan- uary 17-19, 2001 , Chris Hankin and Dave Schmidt (Eds.). ACM, 67–80. https: //doi.org/10.1145/360204.360209
-
[33]
Haruo Hosoya and Benjamin C. Pierce. 2003. XDuce: A statically typed XML processing language. ACM Trans. Internet Techn. 3, 2 (2003), 117–148. https: //doi.org/10.1145/767193.767195
-
[34]
Haruo Hosoya, Jerome Vouillon, and Benjamin C. Pierce. 2005. Regular ex- pression types for XML. ACM Trans. Program. Lang. Syst. 27, 1 (2005), 46–90. https://doi.org/10.1145/1053468.1053470
-
[35]
Stefan Klessinger, Michael Fruth, Valentin Gittinger, Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2023. Tagger: A Tool for the Discovery of Tagged Unions in JSON Schema Extraction. InProceedings 26th International Conference on Extending Database Technology, EDBT 2023, Ioannina, Greece, March 28-31,
work page 2023
-
[36]
https://doi.org/10.48786/EDBT.2023.75
OpenProceedings.org, 827–830. https://doi.org/10.48786/EDBT.2023.75
-
[37]
Meike Klettke, Uta Störl, and Stefanie Scherzinger. 2015. Schema Extraction and Structural Outlier Detection for JSON-based NoSQL Data Stores. In Proc. BTW (LNI, Vol. P-241). 425–444
work page 2015
-
[38]
Yu Liu, Duantengchuan Li, Kaili Wang, Zhuoran Xiong, Fobo Shi, Jian Wang, Bing Li, and Bo Hang. 2024. Are LLMs good at structured outputs? A benchmark for evaluating structured output capabilities in LLMs. Inf. Process. Manag. 61, 5 (2024), 103809. https://doi.org/10.1016/J.IPM.2024.103809
-
[39]
Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc
Felipe Pezoa, Juan L. Reutter, Fernando Suárez, Martín Ugarte, and Domagoj Vrgoc. 2016. Foundations of JSON Schema. In Proc. WWW. 263–273
work page 2016
-
[40]
William Spoth, Oliver Kennedy, Ying Lu, Beda Christoph Hammerschmidt, and Zhen Hua Liu. 2021. Reducing Ambiguity in Json Schema Discovery. InSIGMOD Conference. ACM, 1732–1744
work page 2021
-
[41]
Tadahiro Suda and Haruo Hosoya. 2005. Non-backtracking Top-Down Algo- rithm for Checking Tree Automata Containment. In Implementation and Appli- cation of Automata, 10th International Conference, CIAA 2005, Sophia Antipolis, France, June 27-29, 2005, Revised Selected Papers (Lecture Notes in Computer Sci- ence), Jacques Farré, Igor Litovsky, and Sylvain S...
-
[42]
Henzinger, and Jean-François Raskin
Martin De Wulf, Laurent Doyen, Thomas A. Henzinger, and Jean-François Raskin. 2006. Antichains: A New Algorithm for Checking Universality of Fi- nite Automata. In Computer Aided Verification, 18th International Conference, CA V 2006, Seattle, W A, USA, August 17-20, 2006, Proceedings (Lecture Notes in Computer Science) , Thomas Ball and Robert B. Jones (E...
-
[43]
Joohyung Yun, Byungchul Tak, and Wook-Shin Han. 2024. ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework. Proc. VLDB Endow.17, 11 (July 2024), 3538–3550. https://doi.org/10.14778/3681954. 3682019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.