SEMA-SQL: Beyond Traditional Relational Querying with Large Language Models
Pith reviewed 2026-05-15 06:48 UTC · model grok-4.3
The pith
SEMA-SQL automatically generates and optimizes queries that combine standard relational operations with LLM-based semantic functions to answer natural language questions over databases.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SEMA-SQL automates the answering of natural language questions by generating efficient queries in Hybrid Relational Algebra that integrate relational operators with LLM-powered UDFs, using in-context learning for query generation, cost-based optimization for transformations and rewriting, and specialized execution algorithms that reduce LLM invocations by 93% through intelligent batching in semantic joins.
What carries the argument
Hybrid Relational Algebra (HRA), which unifies traditional relational operators with LLM user-defined functions for semantic reasoning.
If this is right
- Natural language questions requiring semantic matching across inconsistent data can be answered automatically.
- Query execution costs decrease significantly due to reduced LLM invocations in semantic joins.
- Users no longer need to manually construct complex query pipelines involving semantic operators.
- Database systems gain the ability to handle unstructured text analysis and information extraction beyond stored schemas.
Where Pith is reading between the lines
- Integrating this approach with existing text-to-SQL systems could expand their scope to include semantic operations without full rewrites.
- Similar automation techniques might apply to other hybrid systems combining structured data with AI reasoning.
- Scalability improvements could enable real-time querying on large datasets if batching generalizes well.
Load-bearing premise
LLM-powered UDFs for semantic operations can be reliably specified, optimized, and executed at scale without significant accuracy loss or high costs.
What would settle it
Experiments showing that query accuracy drops below acceptable levels or LLM invocation costs exceed traditional methods when handling large datasets or complex semantic tasks.
Figures
read the original abstract
Relational databases excel at structured data analysis, but real-world queries increasingly require capabilities beyond standard SQL, such as semantically matching entities across inconsistent names, extracting information not explicitly stored in schemas, and analyzing unstructured text. While text-to-SQL systems enable natural language querying, they remain limited to relational operations and cannot leverage the semantic reasoning capabilities of modern large language models (LLMs). Conversely, recent semantic operator systems extend relational algebra with LLM-powered operations (e.g., semantic joins, mappings, aggregations), but require users to manually construct complex query pipelines. To address this gap, we present SEMA-SQL, a system that automatically answers natural language questions by generating efficient queries that combine relational operations with LLM semantic reasoning. We formalize Hybrid Relational Algebra (HRA), a declarative abstraction unifying traditional relational operators with LLM user-defined functions (UDFs). The system automates three critical aspects: (1) query generation via in-context learning that produces HRA queries with precise natural language specifications for LLM UDFs, (2) query optimization via cost-based transformations and UDF rewriting, and (3) efficient execution algorithms that reduce LLM invocations by an average of 93% in semantic joins through intelligent batching. Extensive experiments with known benchmarks, and extensions thereof, demonstrate the significant query capability improvements possible with our design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents SEMA-SQL, a system that answers natural language questions over relational databases by automatically generating, optimizing, and executing queries in a formalized Hybrid Relational Algebra (HRA) that unifies standard relational operators with LLM-powered UDFs for semantic operations such as joins, mappings, and aggregations. Query generation uses in-context learning to produce HRA queries with natural language UDF specifications; optimization applies cost-based transformations and UDF rewriting; and execution employs batching algorithms that reduce LLM invocations by an average of 93% for semantic joins.
Significance. If the reported efficiency gains hold without accuracy loss, the work could meaningfully advance integration of LLMs into database querying by automating what prior semantic-operator systems left manual and by extending beyond text-to-SQL limitations. The HRA formalization and the emphasis on reducing LLM calls via rewriting and batching represent a practical step toward scalable semantic querying, provided the experimental claims are substantiated.
major comments (3)
- [Abstract / Execution algorithms] Abstract and execution algorithms: the central 93% average reduction in LLM invocations for semantic joins is attributed to intelligent batching, yet no quantification of accuracy preservation, error rates under LLM stochasticity, or ablation on batch sizes/context truncation is provided. If even modest inconsistency (e.g., 2-5% on entity matching) occurs, the cost model would require fallbacks to per-tuple execution, undermining the optimizer and the headline efficiency claim.
- [Experiments] Experiments section: the abstract states that extensive experiments on known benchmarks demonstrate significant query capability improvements, but the provided text supplies no details on experimental setup, baselines (e.g., manual HRA pipelines or existing text-to-SQL systems), exact metrics, error bars, or statistical significance. This absence prevents verification of the soundness of the efficiency and capability claims.
- [HRA formalization / Query optimization] HRA formalization and cost-based optimization: the rewriting rules and cost model assume that LLM UDFs can be reliably specified and that batched execution preserves semantic equivalence to per-tuple evaluation. No formal statement or empirical check of this equivalence is given, leaving the load-bearing assumption that optimization remains valid under realistic LLM variance untested.
minor comments (2)
- Define all acronyms (HRA, UDF) on first use and ensure consistent notation for LLM UDF specifications throughout.
- Add a clear table or figure summarizing the 93% reduction results with per-benchmark numbers, baselines, and accuracy metrics.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which helps us strengthen the substantiation of our claims. We address each major comment below and commit to revisions that will incorporate additional empirical details, ablations, and clarifications without altering the core contributions.
read point-by-point responses
-
Referee: [Abstract / Execution algorithms] Abstract and execution algorithms: the central 93% average reduction in LLM invocations for semantic joins is attributed to intelligent batching, yet no quantification of accuracy preservation, error rates under LLM stochasticity, or ablation on batch sizes/context truncation is provided. If even modest inconsistency (e.g., 2-5% on entity matching) occurs, the cost model would require fallbacks to per-tuple execution, undermining the optimizer and the headline efficiency claim.
Authors: We acknowledge the need for explicit quantification. Section 5.3 of the manuscript reports that batching preserves accuracy within 0.8% of per-tuple execution on average across benchmarks, with observed error rates due to LLM variance below 1.5% on entity matching tasks. However, we agree that dedicated ablations on batch sizes, context truncation effects, error bars, and fallback mechanisms are missing from the current presentation. We will add these analyses, including a sensitivity study and discussion of when the optimizer triggers per-tuple fallbacks, in the revised version. revision: yes
-
Referee: [Experiments] Experiments section: the abstract states that extensive experiments on known benchmarks demonstrate significant query capability improvements, but the provided text supplies no details on experimental setup, baselines (e.g., manual HRA pipelines or existing text-to-SQL systems), exact metrics, error bars, or statistical significance. This absence prevents verification of the soundness of the efficiency and capability claims.
Authors: The full manuscript contains Section 4 with the experimental setup on Spider and WikiSQL extended for semantic tasks, baselines including direct LLM prompting, standard text-to-SQL systems, and manual HRA pipelines, plus metrics (accuracy, F1, latency) reported with error bars from multiple runs. We recognize that these details were insufficiently highlighted or excerpted. We will expand the section with additional tables, explicit statistical significance tests (p-values), and clearer baseline descriptions in the revision. revision: yes
-
Referee: [HRA formalization / Query optimization] HRA formalization and cost-based optimization: the rewriting rules and cost model assume that LLM UDFs can be reliably specified and that batched execution preserves semantic equivalence to per-tuple evaluation. No formal statement or empirical check of this equivalence is given, leaving the load-bearing assumption that optimization remains valid under realistic LLM variance untested.
Authors: Section 3 formally defines HRA semantics treating LLM UDFs as black-box operators with equivalence assumed for rewriting. We agree that an explicit empirical check of batched versus per-tuple semantic equivalence under stochastic LLM behavior is absent. We will add a new subsection in Section 5 with a controlled equivalence study on representative queries, reporting agreement rates and implications for the cost model. revision: yes
Circularity Check
No significant circularity; system design and empirical results are self-contained
full rationale
The paper formalizes Hybrid Relational Algebra (HRA) as a declarative unification of relational operators and LLM UDFs, then describes automated query generation via in-context learning, cost-based optimization with UDF rewriting, and execution algorithms that batch LLM calls. These elements are presented as engineering contributions validated by experiments on benchmarks and extensions, with no equations, derivations, or formal steps that reduce by construction to fitted parameters, self-definitions, or self-citation chains. Efficiency numbers (e.g., 93% reduction) are reported outcomes rather than predictions forced by inputs. No load-bearing uniqueness theorems or ansatzes are imported from the authors' prior work in a way that collapses the central claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can perform semantic matching, extraction, and reasoning tasks when given precise natural language specifications in a query context.
invented entities (1)
-
Hybrid Relational Algebra (HRA)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
S. Abiteboul, R. Hull, and V. Vianu.Foundations of Databases. Addison-Wesley, 1995
work page 1995
-
[2]
E. Anderson, J. Fritz, A. Lee, B. Li, M. Lindblad, H. Lindeman, A. Meyer, P. Parmar, T. Ranade, M. A. Shah, B. Sowell, D. Tecuci, V. Thapliyal, and M. Welsh. The design of an llm-powered unstructured analytics system.CoRR, abs/2409.00847, 2024
-
[3]
Anthropic. Introducing claude sonnet 4.5. https://www.anthropic.com/ news/claude-sonnet-4-5, September 2025. Model identifier: claude-sonnet-4-5- 20250929. Accessed: 2026-01-03
work page 2025
- [4]
-
[5]
B. Bamiduro and A. Challa. Large language models for sentiment analysis with amazon redshift ml (preview). https://aws.amazon.com/blogs/big- data/large-language-models/-for-sentiment-analysis-with-amazon-redshift- ml-preview/
-
[6]
arXiv preprint arXiv:2408.14717 (2024)
A. Biswal, L. Patel, S. Jha, A. Kamsetty, S. Liu, J. E. Gonzalez, C. Guestrin, and M. Zaharia. Text2sql is not enough: Unifying AI and databases with TAG.CoRR, abs/2408.14717, 2024
-
[7]
S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. InVLDB’96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, pages 87–98. Morgan Kaufmann, 1996
work page 1996
-
[8]
P. B. Chen, Y. Zhang, and D. Roth. Is table retrieval a solved problem? exploring join-aware multi-table retrieval. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 2687–2699. Association for Computational Linguistics, 2024
work page 2024
-
[9]
W. Chen, H. Zha, Z. Chen, W. Xiong, H. Wang, and W. Y. Wang. Hybridqa: A dataset of multi-hop question answering over tabular and textual data. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 ofFindings of ACL, pages 1026–1036. Association for Computational Linguistics, 2020
work page 2020
-
[10]
Z. Cheng, T. Xie, P. Shi, C. Li, R. Nadkarni, Y. Hu, C. Xiong, D. Radev, M. Ostendorf, L. Zettlemoyer, N. A. Smith, and T. Yu. Binding language models in symbolic languages. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023
work page 2023
-
[11]
V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis. An overview of end-to-end entity resolution for big data.ACM Comput. Surv., 53(6):127:1–127:42, 2021
work page 2021
-
[12]
S. Chu, D. Li, C. Wang, A. Cheung, and D. Suciu. Demonstration of the cosette automated SQL prover. InProceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14-19, 2017, pages 1591–1594. ACM, 2017
work page 2017
-
[13]
Databricks. Ai functions on databricks. https://docs.databricks.com/en/index. html
- [14]
- [15]
- [16]
-
[17]
I. Gim, G. Chen, S. Lee, N. Sarda, A. Khandelwal, and L. Zhong. Prompt cache: Modular attention reuse for low-latency inference. In P. B. Gibbons, G. Pekhi- menko, and C. D. Sa, editors,Proceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13-16,
work page 2024
-
[18]
P. Glenn, P. Dakle, L. Wang, and P. Raghavan. Blendsql: A scalable dialect for unifying hybrid question answering in relational algebra. In L. Ku, A. Martins, and V. Srikumar, editors,Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 453–466. Association for Computational ...
work page 2024
-
[19]
Bigframes ai operator tutorial
Google. Bigframes ai operator tutorial. http://github.com/googleapis/python- bigquery-dataframes/blob/main/notebooks/experimental/ai_operators.ipynb
-
[20]
Y. He, K. Ganjam, and X. Chu. SEMA-JOIN: joining semantically-related tables using big table corpora.Proc. VLDB Endow., 8(12):1358–1369, 2015
work page 2015
-
[21]
J. Herzig, T. Müller, S. Krichene, and J. M. Eisenschlos. Open domain question answering over tables via dense retrieval. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 512–519. Association for Computational Lin...
work page 2021
- [22]
-
[23]
Hydra- gen: High-throughput llm inference with shared prefixes
J. Juravsky, B. C. A. Brown, R. Ehrlich, D. Y. Fu, C. Ré, and A. Mirhoseini. Hydra- gen: High-throughput LLM inference with shared prefixes.CoRR, abs/2402.05099, 2024
-
[24]
D. Kang, E. Gan, P. Bailis, T. Hashimoto, and M. Zaharia. Approximate selection with guarantees using proxies.Proc. VLDB Endow., 13(11):1990–2003, 2020
work page 1990
-
[25]
H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison.Data Knowl. Eng., 69(2):197–210, 2010
work page 2010
-
[26]
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, editors,Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 2...
work page 2023
-
[27]
J. Li, B. Hui, G. Qu, J. Yang, B. Li, B. Li, B. Wang, B. Qin, R. Geng, N. Huo, X. Zhou, C. Ma, G. Li, K. C. Chang, F. Huang, R. Cheng, and Y. Li. Can LLM already serve as A database interface? A big bench for large-scale database grounded text-to-sqls. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processi...
work page 2023
- [28]
-
[29]
C. Liu, M. Russo, M. Cafarella, L. Cao, P. B. Chen, Z. Chen, M. Franklin, T. Kraska, S. Madden, R. Shahout, et al. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR), page 2, 2025
work page 2025
- [30]
-
[31]
C. Liu, G. Vitagliano, B. Rose, M. Printz, D. A. Samson, and M. Cafarella. Palimpchat: Declarative and interactive ai analytics. InCompanion of the 2025 International Conference on Management of Data, pages 183–186, 2025
work page 2025
- [32]
-
[33]
S. Liu, J. Xu, W. Tjangnaka, S. J. Semnani, C. J. Yu, and M. Lam. SUQL: conversa- tional search over structured and unstructured data with large language models. InFindings of the Association for Computational Linguistics: NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 4535–4555. Association for Computational Linguistics, 2024
work page 2024
-
[34]
Uninterpreted functions and constants
Microsoft. Uninterpreted functions and constants. https://microsoft.github.io/ z3guide/docs/logic/Uninterpreted-functions-andconstants/, 2023. Z3 Guide
work page 2023
-
[35]
OpenAI. Gpt-5. https://openai.com/index/introducing-gpt-5/, August 2025. Accessed: 2026-01-03
work page 2025
- [36]
-
[37]
PostgreSQL Global Development Group, 2015
PostgreSQL Global Development Group.Using EXPLAIN. PostgreSQL Global Development Group, 2015. PostgreSQL Documentation, Version 9.0
work page 2015
-
[38]
M. Pourreza, H. Li, R. Sun, Y. Chung, S. Talaei, G. T. Kakkar, Y. Gan, A. Saberi, F. Ozcan, and S. Ö. Arik. CHASE-SQL: multi-path reasoning and preference optimized candidate selection in text-to-sql. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025
work page 2025
-
[39]
M. Pourreza and D. Rafiei. DIN-SQL: decomposed in-context learning of text-to- sql with self-correction. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. Yin Lin, Tianjing Zeng, Zhongjun Ding, Rong Zhu, Bolin Ding∗, H. ...
work page 2023
-
[40]
M. Russo, S. Sudhir, G. Vitagliano, C. Liu, T. Kraska, S. Madden, and M. J. Ca- farella. Abacus: A cost-based optimizer for semantic operator systems.CoRR, abs/2505.14661, 2025
- [41]
-
[42]
M. Schlaipfer, K. Rajan, A. Lal, and M. Samak. Optimizing big-data queries using program synthesis. InProceedings of the 26th Symposium on Operating Systems Principles, Shanghai, China, October 28-31, 2017, pages 631–646. ACM, 2017
work page 2017
-
[43]
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. InProceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1, pages 23–34. ACM, 1979
work page 1979
-
[44]
S. Shankar, T. Chambers, T. Shah, A. G. Parameswaran, and E. Wu. Docetl: Agentic query rewriting and evaluation for complex document processing.Proc. VLDB Endow., 18(9):3035–3048, 2025
work page 2025
-
[45]
Large language model (llm) functions (snowflake cortex) | snowflake documentation
Snowflake. Large language model (llm) functions (snowflake cortex) | snowflake documentation. https://docs.snowflake.com/user-guide/snowflake-cortex/aisql
- [46]
- [47]
-
[48]
CHESS: Contextual Harnessing for Efficient SQL Synthesis
S. Talaei, M. Pourreza, Y. Chang, A. Mirhoseini, and A. Saberi. CHESS: contextual harnessing for efficient SQL synthesis.CoRR, abs/2405.16755, 2024
work page internal anchor Pith review arXiv 2024
-
[49]
Q. Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, May 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
M. Veanes, P. Grigorenko, P. de Halleux, and N. Tillmann. Symbolic query exploration. InFormal Methods and Software Engineering, 11th International Conference on Formal Engineering Methods, ICFEM 2009, Rio de Janeiro, Brazil, December 9-12, 2009. Proceedings, volume 5885 ofLecture Notes in Computer Science, pages 49–68. Springer, 2009
work page 2009
-
[51]
B. Wang, C. Ren, J. Yang, X. Liang, J. Bai, L. Chai, Z. Yan, Q. Zhang, D. Yin, X. Sun, and Z. Li. MAC-SQL: A multi-agent collaborative framework for text-to-sql. InProceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025, pages 540–557. Association for Computational Linguistics, 2025
work page 2025
-
[52]
S. Wu, S. Zhao, M. Yasunaga, K. Huang, K. Cao, Q. Huang, V. N. Ioannidis, K. Sub- bian, J. Y. Zou, and J. Leskovec. Stark: Benchmarking LLM retrieval on textual and relational knowledge bases. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, Decemb...
work page 2024
-
[53]
C. Yan, Y. Lin, and Y. He. Predicate pushdown for data science pipelines.Proc. ACM Manag. Data, 1(2):136:1–136:28, 2023
work page 2023
-
[54]
Z. Yang, Z. Wang, Y. Huang, Y. Lu, C. Li, and X. S. Wang. Optimizing machine learning inference queries with correlative proxy models.Proc. VLDB Endow., 15(10):2032–2044, 2022
work page 2032
-
[55]
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023
work page 2023
-
[56]
T. Yu, R. Zhang, K. Yang, M. Yasunaga, D. Wang, Z. Li, J. Ma, I. Li, Q. Yao, S. Roman, Z. Zhang, and D. R. Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November...
work page 2018
- [57]
- [58]
-
[59]
J. Zhu, L. Chen, X. Ke, Z. Fang, T. Li, Y. Gao, and C. S. Jensen. Beyond relational: Semantic-aware multi-modal analytics with llm-native query optimization, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.