Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

Ivan Fokeev; Michael Rumiantsau

arxiv: 2604.25149 · v1 · submitted 2026-04-28 · 💻 cs.AI

Semantic Layers for Reliable LLM-Powered Data Analytics: A Paired Benchmark of Accuracy and Hallucination Across Three Frontier Models

Michael Rumiantsau , Ivan Fokeev This is my paper

Pith reviewed 2026-05-07 16:37 UTC · model grok-4.3

classification 💻 cs.AI

keywords semantic layersLLM text-to-SQLhallucinationdata analyticsbenchmarkbusiness semanticscontext engineeringnatural language queries

0 comments

The pith

Supplying a compact semantic-layer document lifts accuracy of frontier LLMs on natural-language database queries by 17-23 percentage points and renders model choice irrelevant.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether LLMs fail at database analytics because they must guess business meanings absent from the raw schema. It runs a paired benchmark of three frontier models on 100 questions over a retail dataset, once with only the schema and once with the schema plus a 4 KB hand-written document that spells out measures, conventions, and disambiguation rules. Adding the document raises accuracy from the 45-50 percent range to the 67-69 percent range for every model; within each condition the models are statistically indistinguishable. The result shows that the dominant errors stem from missing semantics rather than from model capacity, so the fix is to change the input rather than to swap the model.

Core claim

The presence of the semantic-layer document accounts for essentially all of the significant variance in accuracy. Adding the 4 KB document improves accuracy by +17 to +23 percentage points across Claude Opus 4.7, Claude Sonnet 4.6, and GPT-5.4. With the document the three models reach 67.7-68.7 percent accuracy and are statistically indistinguishable; without it they reach 45.5-50.5 percent and are also indistinguishable. Every cross-condition comparison is significant at p < 0.01. Explicit business semantics therefore suppress the main class of text-to-SQL errors by changing the task the model faces rather than by increasing model capability.

What carries the argument

The semantic-layer document: a 4 KB hand-authored markdown file that encodes the dataset's measures, conventions, and disambiguation rules and is supplied as additional context alongside the warehouse schema.

If this is right

Model choice inside the frontier tier contributes little once business semantics are explicitly supplied.
The dominant failure mode in current LLM text-to-SQL is missing context rather than insufficient reasoning ability.
Accuracy and hallucination share a single root cause addressable by the same context change.
The paired design isolates the effect of the semantic layer from model differences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Semantic layers could reduce reliance on model scaling or fine-tuning for enterprise analytics workloads.
Hand-authoring the semantic document may become a practical bottleneck when applied to thousands of datasets.
The approach may transfer to other LLM tasks where domain conventions are not captured by raw data schemas.
Prompt-length effects could be isolated in follow-up work by using a fixed-length control document that adds no semantics.

Load-bearing premise

The 100 natural-language questions and the hand-authored 4 KB semantic document are representative of real-world analytical use cases and do not introduce selection bias or confounding prompt-length effects.

What would settle it

A replication on a different dataset with a larger, independently authored question set that shows no accuracy gain when the same style of semantic document is added.

read the original abstract

LLMs deployed for natural-language querying of analytical databases suffer from two intertwined failures - incorrect answers and confident hallucinations - both rooted in the same cause: the model is forced to infer business semantics that the schema does not encode. We test whether supplying those semantics as context closes the gap. We benchmark three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse, using a paired single-shot protocol. Each model is evaluated twice: once given only the warehouse schema, and once given the schema plus a 4 KB hand-authored markdown document describing the dataset's measures, conventions, and disambiguation rules. Adding the document improves accuracy by +17 to +23 percentage points across all three models. With it, the three models are statistically indistinguishable (67.7-68.7%); without it, they are also indistinguishable (45.5-50.5%). Every cross-cluster comparison is significant at p < 0.01. The presence of the semantic-layer document accounts for essentially all of the significant variance; model choice within tier does not. We interpret this as a structural result: explicit business semantics suppress the dominant class of text-to-SQL errors not by making the model more capable, but by changing what the model is being asked to do.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a paired benchmark of three frontier LLMs (Claude Opus 4.7, Claude Sonnet 4.6, GPT-5.4) on 100 natural-language questions over the Cleaned Contoso Retail Dataset in ClickHouse. Each model is tested twice in a single-shot protocol: once with only the warehouse schema and once with the schema plus a 4 KB hand-authored markdown semantic-layer document describing measures, conventions, and disambiguation rules. Key results are accuracy gains of +17 to +23 percentage points with the document, statistical indistinguishability of models with the document (67.7-68.7%) versus without (45.5-50.5%), and p < 0.01 significance for all cross-cluster comparisons. The paper concludes that the semantic document accounts for essentially all significant variance while model choice within tier does not, interpreting this as a structural effect of changing the task rather than enhancing model capability.

Significance. If the result holds, the work would be significant for LLM-powered data analytics: it provides empirical evidence that explicit business semantics can close a large fraction of the accuracy/hallucination gap in text-to-SQL, potentially outweighing incremental model improvements. The paired single-shot protocol and reported statistical tests offer a clear, falsifiable measurement framework that could guide future semantic-layer research. Credit is due for the controlled design that isolates the document effect and for the direct variance attribution claim.

major comments (3)

[Abstract / Experimental Protocol] Abstract / Experimental Protocol: The description of the 100 natural-language questions provides no information on selection process, sourcing (e.g., independent generation, query logs, or same-team authorship), diversity metrics, or blinding relative to the 4 KB semantic document. This is load-bearing for the central claim that the document accounts for all significant variance, because selection bias toward ambiguities resolved by the document could produce the +17–23 pp lift and the indistinguishability results without supporting the broader structural interpretation.
[Methods (paired single-shot protocol)] Methods (paired single-shot protocol): No details are given on how answer correctness was judged (automated SQL execution against ground truth, human evaluation criteria, or exact-match rules). Without this, the reported accuracy percentages, p < 0.01 significance levels, and within-cluster indistinguishability cannot be independently verified or reproduced.
[Results] Results: The manuscript does not discuss or control for prompt-length differences between the schema-only and schema-plus-document conditions. The added ~4 KB document could affect model behavior through length or formatting alone, confounding the attribution of all variance to semantic content rather than surface features.

minor comments (2)

[Abstract] Abstract: The accuracy figures are given only as ranges (+17 to +23 pp; 67.7-68.7%; 45.5-50.5%) without per-model breakdowns or a summary table; adding exact per-model, per-condition accuracies would improve readability and allow direct inspection of the claimed indistinguishability.
[Abstract] Abstract: The statistical test underlying the p < 0.01 claims and the exact p-values for the within-cluster comparisons are not stated; specifying the test (e.g., McNemar, chi-square) would strengthen the results section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which highlight important aspects of clarity, reproducibility, and potential confounds in our experimental design. We address each major comment point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experimental Protocol] Abstract / Experimental Protocol: The description of the 100 natural-language questions provides no information on selection process, sourcing (e.g., independent generation, query logs, or same-team authorship), diversity metrics, or blinding relative to the 4 KB semantic document. This is load-bearing for the central claim that the document accounts for all significant variance, because selection bias toward ambiguities resolved by the document could produce the +17–23 pp lift and the indistinguishability results without supporting the broader structural interpretation.

Authors: We agree that the manuscript would benefit from greater transparency on question selection to support the variance attribution claim. The questions were curated by the authors to represent typical analytical workloads on the Contoso Retail dataset, with deliberate inclusion of queries involving measure definitions, temporal conventions, and disambiguation that the schema alone cannot resolve. In the revised manuscript, we will expand the Experimental Protocol section to describe the sourcing and selection process, provide diversity metrics (e.g., breakdown by query category such as aggregations, filters, and joins), confirm that the question set was finalized prior to semantic document authoring, and commit to releasing the full question list, ground truths, and semantic document as open supplementary material. This will allow independent evaluation of any selection effects while preserving the paired design's ability to isolate the document's contribution. revision: yes
Referee: [Methods (paired single-shot protocol)] Methods (paired single-shot protocol): No details are given on how answer correctness was judged (automated SQL execution against ground truth, human evaluation criteria, or exact-match rules). Without this, the reported accuracy percentages, p < 0.01 significance levels, and within-cluster indistinguishability cannot be independently verified or reproduced.

Authors: We acknowledge this as a critical omission for reproducibility. Answer correctness was determined via automated execution: each generated SQL query was run against the ClickHouse instance, and the result set was compared to a pre-defined ground-truth result for that question using exact row and column matching (with standard floating-point tolerance for numeric values). We will add a dedicated paragraph in the Methods section detailing this evaluation protocol, including the comparison logic, any post-processing, and reference to the verification code. This addition will enable full reproduction of the accuracy figures, p-values, and model indistinguishability results. revision: yes
Referee: [Results] Results: The manuscript does not discuss or control for prompt-length differences between the schema-only and schema-plus-document conditions. The added ~4 KB document could affect model behavior through length or formatting alone, confounding the attribution of all variance to semantic content rather than surface features.

Authors: This is a legitimate potential confound that the current manuscript does not address. The ~4 KB semantic document necessarily increases prompt length and alters formatting, which could influence model behavior independently of semantic content. In the revision, we will add explicit discussion of this issue in the Results and Limitations sections, including approximate token counts for both conditions and a qualification that the observed gains are consistent with a semantic effect but cannot be isolated from length without additional controls (such as a non-semantic length-matched filler document). We will also note this as an avenue for future work. The core empirical results and statistical tests remain unchanged, but the interpretation will be more precisely bounded. revision: partial

Circularity Check

0 steps flagged

No circularity: direct empirical measurements from controlled paired benchmark with no derivations or fitted quantities

full rationale

The paper reports observed accuracy rates from a paired single-shot evaluation of three LLMs on 100 fixed natural-language questions, once with schema only and once with schema plus a hand-authored 4KB markdown document. No equations, first-principles derivations, parameter fitting, or predictions appear; the central claim (semantic document accounts for essentially all variance) is an interpretation of the measured deltas (+17 to +23 pp) and statistical indistinguishability tests, not a quantity defined in terms of itself. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the protocol or results. The experiment is self-contained against external benchmarks because accuracy is measured directly against ground-truth answers rather than inferred from model internals or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on direct measurement in a controlled paired experiment. No free parameters or invented entities are introduced. The primary assumption is that the chosen dataset, questions, and semantic document are representative of typical LLM analytics tasks.

axioms (1)

domain assumption The 100 natural-language questions and the hand-authored semantic document are representative of real-world analytical queries and business semantics.
Generalization of the accuracy improvement and model-independence finding to broader LLM data analytics relies on this representativeness.

pith-pipeline@v0.9.0 · 5561 in / 1394 out tokens · 78954 ms · 2026-05-07T16:37:41.966331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 9 canonical work pages

[1]

and Sequeda, J

Allemang, D. and Sequeda, J. (2024). Increasing the LLM accuracy for question answering: Ontologies to the rescue! In The Semantic Web – ISWC 2024, LNCS vol. 15233. Springer. arXiv:2405.11706. Chen, K., Chen, Y., Koudas, N., and Yu, X. (2025). Reliable text-to-SQL with adaptive abstention. Proceedings of the ACM on Management of Data (SIGMOD), 3(1), Artic...

work page arXiv 2024
[2]

Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024

https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026. Gao, D., Wang, H., Li, Y., et al. (2024). Text-to-SQL empowered by large language models: A benchmark evaluation. Proceedings of the VLDB Endowment, 17(5), pp. 1132–1145. Gekhman, Z., Herzig, J., Aharoni, R., et al. (2024). Does fine-tuning LLMs on new knowledge encourage hallucinations? arX...

work page arXiv 2026
[3]

Liu, Y., Wang, X., Chen, Y., et al

arXiv:2503.11984. Liu, Y., Wang, X., Chen, Y., et al. (2025). LogicCat: A chain-of-thought text-to-SQL benchmark for complex reasoning. arXiv:2505.18744. Luo, T., Zhang, X., Yu, H., and Chen, L. (2026). Ontology-grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informati...

work page arXiv 2025
[4]

and Rafiei, D

Pourreza, M. and Rafiei, D. (2023). DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. NeurIPS 2023, pp. 36339–36348. Qu, G., Li, J., Li, B., et al. (2024). Before generation, align it! A novel and effective strategy for mitigating hallucinations in text-to-SQL generation. Findings of ACL 2024, pp. 5456–5471. Rumiantsau, M., Vert...

work page arXiv 2023
[5]

A survey on employing large language models for text-to-SQL tasks.ACM Computing Surveys, 2024

arXiv:2311.07509. Shen, J., Wan, C., Qiao, R., et al. (2025). A study of in-context-learning-based text-to-SQL errors. arXiv:2501.09310. Shi, L., Tang, Z., Zhang, N., Zhang, X., and Yang, Z. (2025). A Survey on Employing Large Language Models for Text-to-SQL Tasks. ACM Computing Surveys. DOI: 10.1145/3737873. arXiv:2407.15186. Shkapenyuk, V., Srivastava, ...

work page doi:10.1145/3737873 2025
[6]

and Rumiantsau, M

Vertsel, A. and Rumiantsau, M. (2024). Hybrid LLM/rule-based approaches to business insights generation from structured data. arXiv:2404.15604. Wen, B. et al. (2025). Know your limits: A survey of abstention in large language models. Transactions of the Association for Computational Linguistics, 13, pp. 529–556. Wretblad, N., Riseby, F., Biswas, R., Ahmad...

work page arXiv 2024
[7]

Yadkori, Y.A

arXiv:2402.12243. Yadkori, Y.A. et al. (2024). Mitigating LLM hallucinations via conformal abstention. NeurIPS

work page arXiv 2024
[8]

Yang, Z., Li, B., Chen, Y., Liu, X., et al. (2025). Hallucination detection for LLM-based text-to-SQL generation via two-stage metamorphic testing (SQLHD). ACM Transactions on Software Engineering and Methodology, 37(6). arXiv:2512.22250. Yu, T., Zhang, R., Yang, K., et al. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain s...

work page arXiv 2025
[9]

I don't know

Zargari Marandi, R., Leung, P., Sigera, C., Murray, D.D., and Wickramaratne, P. (2024). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware structured query language. JAMIA Open. arXiv:2406.10690. Zhang, H., Diao, S., Lin, Y., et al. (2024b). R-Tuning: Instructing large language models to say “I don't kno...

work page arXiv 2024

[1] [1]

and Sequeda, J

Allemang, D. and Sequeda, J. (2024). Increasing the LLM accuracy for question answering: Ontologies to the rescue! In The Semantic Web – ISWC 2024, LNCS vol. 15233. Springer. arXiv:2405.11706. Chen, K., Chen, Y., Koudas, N., and Yu, X. (2025). Reliable text-to-SQL with adaptive abstention. Proceedings of the ACM on Management of Data (SIGMOD), 3(1), Artic...

work page arXiv 2024

[2] [2]

Does fine-tuning llms on new knowledge encourage hallucinations? arXiv preprint arXiv:2405.05904, 2024

https://docs.getdbt.com/blog/semantic-layer-vs-text-to-sql-2026. Gao, D., Wang, H., Li, Y., et al. (2024). Text-to-SQL empowered by large language models: A benchmark evaluation. Proceedings of the VLDB Endowment, 17(5), pp. 1132–1145. Gekhman, Z., Herzig, J., Aharoni, R., et al. (2024). Does fine-tuning LLMs on new knowledge encourage hallucinations? arX...

work page arXiv 2026

[3] [3]

Liu, Y., Wang, X., Chen, Y., et al

arXiv:2503.11984. Liu, Y., Wang, X., Chen, Y., et al. (2025). LogicCat: A chain-of-thought text-to-SQL benchmark for complex reasoning. arXiv:2505.18744. Luo, T., Zhang, X., Yu, H., and Chen, L. (2026). Ontology-grounded knowledge graphs for mitigating hallucinations in large language models for clinical question answering. Journal of Biomedical Informati...

work page arXiv 2025

[4] [4]

and Rafiei, D

Pourreza, M. and Rafiei, D. (2023). DIN-SQL: Decomposed in-context learning of text-to-SQL with self-correction. NeurIPS 2023, pp. 36339–36348. Qu, G., Li, J., Li, B., et al. (2024). Before generation, align it! A novel and effective strategy for mitigating hallucinations in text-to-SQL generation. Findings of ACL 2024, pp. 5456–5471. Rumiantsau, M., Vert...

work page arXiv 2023

[5] [5]

A survey on employing large language models for text-to-SQL tasks.ACM Computing Surveys, 2024

arXiv:2311.07509. Shen, J., Wan, C., Qiao, R., et al. (2025). A study of in-context-learning-based text-to-SQL errors. arXiv:2501.09310. Shi, L., Tang, Z., Zhang, N., Zhang, X., and Yang, Z. (2025). A Survey on Employing Large Language Models for Text-to-SQL Tasks. ACM Computing Surveys. DOI: 10.1145/3737873. arXiv:2407.15186. Shkapenyuk, V., Srivastava, ...

work page doi:10.1145/3737873 2025

[6] [6]

and Rumiantsau, M

Vertsel, A. and Rumiantsau, M. (2024). Hybrid LLM/rule-based approaches to business insights generation from structured data. arXiv:2404.15604. Wen, B. et al. (2025). Know your limits: A survey of abstention in large language models. Transactions of the Association for Computational Linguistics, 13, pp. 529–556. Wretblad, N., Riseby, F., Biswas, R., Ahmad...

work page arXiv 2024

[7] [7]

Yadkori, Y.A

arXiv:2402.12243. Yadkori, Y.A. et al. (2024). Mitigating LLM hallucinations via conformal abstention. NeurIPS

work page arXiv 2024

[8] [8]

Yang, Z., Li, B., Chen, Y., Liu, X., et al. (2025). Hallucination detection for LLM-based text-to-SQL generation via two-stage metamorphic testing (SQLHD). ACM Transactions on Software Engineering and Methodology, 37(6). arXiv:2512.22250. Yu, T., Zhang, R., Yang, K., et al. (2018). Spider: A large-scale human-labeled dataset for complex and cross-domain s...

work page arXiv 2025

[9] [9]

I don't know

Zargari Marandi, R., Leung, P., Sigera, C., Murray, D.D., and Wickramaratne, P. (2024). Automating pharmacovigilance evidence generation: Using large language models to produce context-aware structured query language. JAMIA Open. arXiv:2406.10690. Zhang, H., Diao, S., Lin, Y., et al. (2024b). R-Tuning: Instructing large language models to say “I don't kno...

work page arXiv 2024