Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models
Pith reviewed 2026-05-07 16:37 UTC · model grok-4.3
The pith
Supplying a compact semantic-layer document lifts accuracy of frontier LLMs on natural-language database queries by 17-23 percentage points and renders model choice irrelevant.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The presence of the semantic-layer document accounts for essentially all of the significant variance in accuracy. Adding the 4 KB document improves accuracy by +17 to +23 percentage points across Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4. With the document the three models reach 67.7-68.7 percent accuracy and are statistically indistinguishable; without it they reach 45.5-50.5 percent and are also indistinguishable. Every cross-condition comparison is significant at p < 0.01. Explicit business semantics therefore suppress the main class of text-to-SQL errors by changing the task the model faces rather than by increasing model capability.
What carries the argument
The semantic-layer document: a 4 KB hand-authored markdown file that encodes the dataset's measures, conventions, and disambiguation rules and is supplied as additional context alongside the warehouse schema.
If this is right
- Model choice inside the frontier tier contributes little once business semantics are explicitly supplied.
- The dominant failure mode in current LLM text-to-SQL is missing context rather than insufficient reasoning ability.
- Accuracy and hallucination share a single root cause addressable by the same context change.
- The paired design isolates the effect of the semantic layer from model differences.
Where Pith is reading between the lines
- Semantic layers could reduce reliance on model scaling or fine-tuning for enterprise analytics workloads.
- Hand-authoring the semantic document may become a practical bottleneck when applied to thousands of datasets.
- The approach may transfer to other LLM tasks where domain conventions are not captured by raw data schemas.
- Prompt-length effects could be isolated in follow-up work by using a fixed-length control document that adds no semantics.
Load-bearing premise
The 100 natural-language questions and the hand-authored 4 KB semantic document are representative of real-world analytical use cases and do not introduce selection bias or confounding prompt-length effects.
What would settle it
A replication on a different dataset with a larger, independently authored question set that shows no accuracy gain when the same style of semantic document is added.
read the original abstract
LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p < 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a paired benchmark of three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse. Each model is tested twice in a single-shot protocol: once with only the warehouse schema and once with the schema plus a 4 KB hand-authored markdown semantic-layer document describing measures, conventions, and disambiguation rules. Key results are accuracy gains of +17 to +23 percentage points with the document, statistical indistinguishability of models with the document (67.7-68.7%) versus without (45.5-50.5%), and p < 0.01 significance for all cross-cluster comparisons. The paper concludes that the semantic document accounts for essentially all significant variance while model choice within tier does not, interpreting this as a structural effect of changing the task rather than enhancing model capability.
Significance. If the result holds, the work would be significant for LLM-powered data analytics: it provides empirical evidence that explicit business semantics can close a large fraction of the accuracy/hallucination gap in text-to-SQL, potentially outweighing incremental model improvements. The paired single-shot protocol and reported statistical tests offer a clear, falsifiable measurement framework that could guide future semantic-layer research. Credit is due for the controlled design that isolates the document effect and for the direct variance attribution claim.
major comments (3)
- [Abstract / Experimental Protocol] Abstract / Experimental Protocol: The description of the 100 natural-language questions provides no information on selection process, sourcing (e.g., independent generation, query logs, or same-team authorship), diversity metrics, or blinding relative to the 4 KB semantic document. This is load-bearing for the central claim that the document accounts for all significant variance, because selection bias toward ambiguities resolved by the document could produce the +17–23 pp lift and the indistinguishability results without supporting the broader structural interpretation.
- [Methods (paired single-shot protocol)] Methods (paired single-shot protocol): No details are given on how answer correctness was judged (automated SQL execution against ground truth, human evaluation criteria, or exact-match rules). Without this, the reported accuracy percentages, p < 0.01 significance levels, and within-cluster indistinguishability cannot be independently verified or reproduced.
- [Results] Results: The manuscript does not discuss or control for prompt-length differences between the schema-only and schema-plus-document conditions. The added ~4 KB document could affect model behavior through length or formatting alone, confounding the attribution of all variance to semantic content rather than surface features.
minor comments (2)
- [Abstract] Abstract: The accuracy figures are given only as ranges (+17 to +23 pp; 67.7-68.7%; 45.5-50.5%) without per-model breakdowns or a summary table; adding exact per-model, per-condition accuracies would improve readability and allow direct inspection of the claimed indistinguishability.
- [Abstract] Abstract: The statistical test underlying the p < 0.01 claims and the exact p-values for the within-cluster comparisons are not stated; specifying the test (e.g., McNemar, chi-square) would strengthen the results section.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important aspects of clarity, reproducibility, and potential confounds in our experimental design. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experimental Protocol] Abstract / Experimental Protocol: The description of the 100 natural-language questions provides no information on selection process, sourcing (e.g., independent generation, query logs, or same-team authorship), diversity metrics, or blinding relative to the 4 KB semantic document. This is load-bearing for the central claim that the document accounts for all significant variance, because selection bias toward ambiguities resolved by the document could produce the +17–23 pp lift and the indistinguishability results without supporting the broader structural interpretation.
Authors: We agree that the manuscript would benefit from greater transparency on question selection to support the variance attribution claim. The questions were curated by the authors to represent typical analytical workloads on the Contoso Retail dataset, with deliberate inclusion of queries involving measure definitions, temporal conventions, and disambiguation that the schema alone cannot resolve. In the revised manuscript, we will expand the Experimental Protocol section to describe the sourcing and selection process, provide diversity metrics (e.g., breakdown by query category such as aggregations, filters, and joins), confirm that the question set was finalized prior to semantic document authoring, and commit to releasing the full question list, ground truths, and semantic document as open supplementary material. This will allow independent evaluation of any selection effects while preserving the paired design's ability to isolate the document's contribution. revision: yes
-
Referee: [Methods (paired single-shot protocol)] Methods (paired single-shot protocol): No details are given on how answer correctness was judged (automated SQL execution against ground truth, human evaluation criteria, or exact-match rules). Without this, the reported accuracy percentages, p < 0.01 significance levels, and within-cluster indistinguishability cannot be independently verified or reproduced.
Authors: We acknowledge this as a critical omission for reproducibility. Answer correctness was determined via automated execution: each generated SQL query was run against the ClickHouse instance, and the result set was compared to a pre-defined ground-truth result for that question using exact row and column matching (with standard floating-point tolerance for numeric values). We will add a dedicated paragraph in the Methods section detailing this evaluation protocol, including the comparison logic, any post-processing, and reference to the verification code. This addition will enable full reproduction of the accuracy figures, p-values, and model indistinguishability results. revision: yes
-
Referee: [Results] Results: The manuscript does not discuss or control for prompt-length differences between the schema-only and schema-plus-document conditions. The added ~4 KB document could affect model behavior through length or formatting alone, confounding the attribution of all variance to semantic content rather than surface features.
Authors: This is a legitimate potential confound that the current manuscript does not address. The ~4 KB semantic document necessarily increases prompt length and alters formatting, which could influence model behavior independently of semantic content. In the revision, we will add explicit discussion of this issue in the Results and Limitations sections, including approximate token counts for both conditions and a qualification that the observed gains are consistent with a semantic effect but cannot be isolated from length without additional controls (such as a non-semantic length-matched filler document). We will also note this as an avenue for future work. The core empirical results and statistical tests remain unchanged, but the interpretation will be more precisely bounded. revision: partial
Circularity Check
No circularity: direct empirical measurements from controlled paired benchmark with no derivations or fitted quantities
full rationale
The paper reports observed accuracy rates from a paired single-shot evaluation of three LLMs on 100 fixed natural-language questions, once with schema only and once with schema plus a hand-authored 4KB markdown document. No equations, first-principles derivations, parameter fitting, or predictions appear; the central claim (semantic document accounts for essentially all variance) is an interpretation of the measured deltas (+17 to +23 pp) and statistical indistinguishability tests, not a quantity defined in terms of itself. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the protocol or results. The experiment is self-contained against external benchmarks because accuracy is measured directly against ground-truth answers rather than inferred from model internals or prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 100 natural-language questions and the hand-authored semantic document are representative of real-world analytical queries and business semantics.
Reference graph
Works this paper leans on
-
[1]
Allemang, D. and Sequeda, J. (2024). Increasing the LLM accuracy for question answering: Ontologies to the rescue! In The Semantic Web – ISWC 2024, LNCS vol. 15233. Springer. arXiv:2405.11706. Chen, K., Chen, Y., Koudas, N., and Yu, X. (2025). Reliable text-to-SQL with adaptive abstention. Proceedings of the ACM on Management of Data (SIGMOD), 3(1), Artic...
-
[2]
https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026. Gao, D., Wang, H., Li, Y., et al. (2024). Text-to-SQL empowered by large language models: A benchmark evaluation. Proceedings of the VLDB Endowment, 17(5), pp. 1132–1145. Gekhman, Z., Herzig, J., Aharoni, R., et al. (2024). Does fine-tuning LLMs on new knowledge encourage hallucinations? arX...
-
[3]
Liu, Y., Wang, X., Chen, Y., et al
arXiv:2503.11984. Liu, Y., Wang, X., Chen, Y., et al. (2025). LogicCat: A chain-of-thought text-to-SQL benchmark for complex reasoning. arXiv:2505.18744. Luo, T., Zhang, X., Yu, H., and Chen, L. (2026). Ontology-grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informati...
-
[4]
Pourreza, M. and Rafiei, D. (2023). DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. NeurIPS 2023, pp. 36339–36348. Qu, G., Li, J., Li, B., et al. (2024). Before generation, align it! A novel and effective strategy for mitigating hallucinations in text-to-SQL generation. Findings of ACL 2024, pp. 5456–5471. Rumiantsau, M., Vert...
-
[5]
A survey on employing large language models for text-to-SQL tasks.ACM Computing Surveys, 2024
arXiv:2311.07509. Shen, J., Wan, C., Qiao, R., et al. (2025). A study of in-context-learning-based text-to-SQL errors. arXiv:2501.09310. Shi, L., Tang, Z., Zhang, N., Zhang, X., and Yang, Z. (2025). A Survey on Employing Large Language Models for Text-to-SQL Tasks. ACM Computing Surveys. DOI: 10.1145/3737873. arXiv:2407.15186. Shkapenyuk, V., Srivastava, ...
-
[6]
Vertsel, A. and Rumiantsau, M. (2024). Hybrid LLM/rule-based approaches to business insights generation from structured data. arXiv:2404.15604. Wen, B. et al. (2025). Know your limits: A survey of abstention in large language models. Transactions of the Association for Computational Linguistics, 13, pp. 529–556. Wretblad, N., Riseby, F., Biswas, R., Ahmad...
-
[7]
arXiv:2402.12243. Yadkori, Y.A. et al. (2024). Mitigating LLM hallucinations via conformal abstention. NeurIPS
-
[8]
Yang, Z., Li, B., Chen, Y., Liu, X., et al. (2025). Hallucination detection for LLM-based text-to-SQL generation via two-stage metamorphic testing (SQLHD). ACM Transactions on Software Engineering and Methodology, 37(6). arXiv:2512.22250. Yu, T., Zhang, R., Yang, K., et al. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain s...
-
[9]
Zargari Marandi, R., Leung, P., Sigera, C., Murray, D.D., and Wickramaratne, P. (2024). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware structured query language. JAMIA Open. arXiv:2406.10690. Zhang, H., Diao, S., Lin, Y., et al. (2024b). R-Tuning: Instructing large language models to say “I don't kno...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.