Recognition: no theorem link
100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models
Pith reviewed 2026-05-15 09:24 UTC · model grok-4.3
The pith
Proxy models over embedding vectors cut AI query costs and latency by more than 100 times while preserving accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Proxy models trained on embedding vectors can approximate the semantic filter and ranking operations performed by LLMs in AI queries, delivering more than 100 times reduction in cost and latency with no material loss in accuracy across tested datasets and query types.
What carries the argument
Lightweight proxy models trained on embedding vectors to approximate LLM semantic judgments for filter and ranking operators.
Load-bearing premise
Proxy models trained on embeddings can reliably approximate the semantic judgments of underlying LLMs across diverse datasets and query types without material accuracy loss.
What would settle it
A head-to-head comparison on a fresh large dataset where the proxy model and full LLM disagree on a substantial fraction of semantic filter decisions would disprove reliable approximation.
Figures
read the original abstract
Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter operator and also important gains for semantic ranking. The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents an empirical study of using lightweight proxy models over embedding vectors to approximate LLM evaluations in AI SQL queries. It claims >100x cost and latency reductions for semantic filter operators and gains for semantic ranking, with accuracy preserved or improved on benchmarks including a 10M-row Amazon reviews dataset. Architectures are proposed for Google BigQuery (online) and AlloyDB (HTAP with offline training), along with techniques to speed up proxy model training.
Significance. If the accuracy claims hold under broader conditions, this approach could make semantic AI queries viable for large-scale, cost-sensitive database applications by dramatically reducing reliance on expensive LLM calls. The work provides practical architectures and training optimizations that address real deployment challenges in data warehouses.
major comments (2)
- The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.
- The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.
minor comments (2)
- Clarify in the architecture sections how the AlloyDB offline training path quantitatively improves latency over the BigQuery online path, with specific numbers.
- The phrase 'occasionally improve accuracy' in the abstract should be backed by explicit examples or delta metrics in the evaluation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the statistical and robustness aspects of our empirical evaluation. We address each major comment below and have revised the manuscript to incorporate additional details, analyses, and clarifications.
read point-by-point responses
-
Referee: The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.
Authors: We agree that greater statistical rigor strengthens the presentation. The revised manuscript now includes error bars (standard deviation over 5 independent runs) on all accuracy plots, explicit details on proxy label generation (LLM annotations on a 10k-sample training subset per benchmark with temperature=0 for determinism), and clarification that the agreement metric is end-to-end query fidelity measured by precision and recall of the final result set against full-LLM execution. We have added an exclusion-criteria paragraph describing removal of queries with >20% token-length variance and a new stress-test subsection covering 12 query phrasings plus a cross-dataset shift experiment (Amazon reviews to Yelp reviews). These changes allow direct assessment of approximation error relative to the claimed cost savings. revision: partial
-
Referee: The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.
Authors: The original evaluation already covers three distinct domains (product reviews, Q&A, and news) with the 10M-row Amazon benchmark as the largest scale test. In revision we have added a dedicated robustness subsection that evaluates the proxy models on 8 held-out query templates per domain and reports accuracy under a controlled distribution shift (training on 2022 reviews, testing on 2023 reviews). We acknowledge that exhaustive coverage of arbitrary unseen domains lies outside the current scope and have therefore added an explicit limitations paragraph plus future-work directions on continual adaptation. revision: partial
Circularity Check
Empirical measurement study with no derivation circularity
full rationale
The paper is an empirical evaluation of proxy models for approximating LLM-based AI queries in databases. It reports measured >100x cost/latency reductions and accuracy preservation on external benchmarks (e.g., 10M-row Amazon reviews) by direct comparison of proxy outputs to LLM judgments. No equations, fitted parameters, or self-citations reduce the central performance claims to inputs defined by the same data or prior author work; the results are externally falsifiable measurements rather than self-referential derivations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Lukas Bruderer and Mihai Ciorobea. 2025. Boost your Search and RAG agents with Vertex AI’s new state-of-the-art Ranking API.Google Cloud Blog(30 May 2025). https://cloud.google.com/blog/products/ai-machine-learning/launching- our-new-state-of-the-art-vertex-ai-ranking-api
work page 2025
-
[2]
Iñigo Casanueva, Hector Perez-Iglesias, Abhinav Rao, Xiaoxue Liu, Yufan Wang, and Hao Sun. 2020. Efficient Intent Detection with Dual Sentence Encoder and Label-Aware Attention. InProceedings of the 2nd Workshop on Natural Lan- guage Processing for Conversational AI. 79–86. https://aclanthology.org/2020. nlp4convai-1.12/
work page 2020
-
[3]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer
-
[4]
SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research16 (2002), 321–357
work page 2002
-
[5]
Ethan Chern, Steffi Freihat, Yangni Shieh, Stephen Wan, Junjie Zhao, Wayne Xin Zhao, et al . 2023. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios.arXiv preprint arXiv:2307.13528(2023). https://arxiv.org/abs/2307.13528
-
[6]
Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Eui- jong Whang. 2019. Automated data slicing for model validation: A big data-ai integration approach.IEEE Transactions on Knowledge and Data Engineering32, 12 (2019), 2284–2296
work page 2019
-
[7]
Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. 2019. Slice Finder: Automated Data Slicing for Model Validation. InICDE. 1514–1525. doi:10.1109/ICDE.2019.00138
-
[8]
Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld
-
[9]
InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL)
SPECTER: Document-level Representation Learning using Citation- informed Transformers. InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL). 4211–4222. doi:10.18653/v1/2020.acl- main.384
-
[10]
Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. InProceedings of the Thirty-First Text REtrieval Conference (TREC 2022) (NIST Special Publication 500-338). https://trec.nist.gov/pubs/trec31/papers/ Overview_deep.pdf
work page 2022
-
[11]
Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans
-
[12]
UQE: A Query Engine for Unstructured Databases.Advances in Neural Information Processing Systems37 (2024), 29807–29838
work page 2024
-
[13]
Databricks. 2025. AI Functions on Databricks. https://docs.databricks.com/aws/ en/large-language-models/ai-functions. Accessed: 2025-07-31
work page 2025
-
[14]
Anas Dorbani, Sunny Yasser, Jimmy Lin, and Amine Mhedhbi. 2025. Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB.Pro- ceedings of the VLDB Endowment18, 12 (2025), 5415–5418. doi:10.14778/3750601. 3750685
- [15]
-
[16]
2025.google/embedding-gemma-300m
Google. 2025.google/embedding-gemma-300m. https://huggingface.co/google/ embeddinggemma-300m
work page 2025
-
[17]
Google Cloud. 2025. AlloyDB AI. https://cloud.google.com/alloydb/ai?e= 48754805. Accessed: 2025-07-31
work page 2025
-
[18]
Google Cloud. 2025. BigQuery ML overview. https://cloud.google.com/bigquery/ docs/bqml-introduction. Accessed: July 31, 2025
work page 2025
-
[19]
Google Cloud. 2025.Generative AI pricing. Google. https://cloud.google.com/ vertex-ai/generative-ai/pricing
work page 2025
-
[20]
Google DeepMind. 2025. Gemini: Model Thinking Updates. Google DeepMind Blog Post. https://blog.google/technology/google-deepmind/gemini-model- thinking-updates-march-2025/ Accessed: October 2025
work page 2025
-
[21]
David Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. InProceedings of the 23rd International Conference on Machine Learning (ICML). ACM, Pittsburgh, PA, USA, 377–384
work page 2006
-
[22]
Srinivasan Iyer, Sewon Min, Yashar Mehdad, and Wen-tau Yih. 2021. RECON- SIDER: improved re-ranking using span-focused cross-attention for open domain question answering. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 1280–1287
work page 2021
- [23]
-
[24]
Saehan Jo and Immanuel Trummer. 2024. Thalamusdb: Approximate query processing on multi-modal data.Proceedings of the ACM on Management of Data 2, 3 (2024), 1–26
work page 2024
-
[25]
Kaggle. 2020. Tweet Sentiment Extraction - Kaggle Competition. Web link to Kaggle competition. https://www.kaggle.com/c/tweet-sentiment-extraction/ Accessed: October 2025
work page 2020
-
[26]
Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249
work page 2022
-
[28]
Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernandez Abrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile Text Embeddings Distil...
-
[30]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150. https://aclanthology.org/ P11-1015/
work page 2011
-
[31]
Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Chal- lenge: Financial Opinion Mining and Question Answering. InWWW ’18 Com- panion: The 2018 Web Conference Companion. ACM, Lyon, France, 1941–1942. doi:10.1145/3184558.3192301
-
[32]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv:2501.19393 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [33]
-
[34]
OpenAI. 2022. Classification using embeddings. https://cookbook.openai.com/ examples/classification_using_embeddings. Accessed: 2025-07-29
work page 2022
-
[35]
R. Kelley Pace and Ronald Barry. 1997. Sparse Spatial Autoregressions.Statistics and Probability Letters33, 3 (1997), 291–297. doi:10.1016/S0167-7152(96)00076-4
-
[36]
Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv e-prints(2024), arXiv–2407
work page 2024
-
[37]
Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProceed- ings of the 11th International Conference on Very Large Databases (CIDR 2025). Am- sterdam, The Netherlands, 12. https://vldb.org/cidrdb/papers/2025/p12-liu.pdf
work page 2025
-
[38]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...
work page 2011
-
[39]
Voorhees, Lucy Lu Wang, and William R
Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen M. Voorhees, Lucy Lu Wang, and William R. Hersh. 2021. Searching for scientific evidence in a pandemic: An overview of TREC-COVID. Journal of Biomedical Informatics121 (2021), 103865. doi:10.1016/j.jbi.2021.103865
-
[40]
Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2025. Abacus: A Cost-Based Optimizer 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models for Semantic Operator Systems.arXiv preprint arXiv:2505.14661(2025)
-
[41]
Elvis Saravia, Hsien-Che Liu, Yen-Hao Huang, Ssu-Han Wu, and Yi-Shin Chen
-
[42]
InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)
CARER: Contextualized Affect Representations for Emotion Recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3689–3698. https://aclanthology.org/D18-1404/
work page 2018
- [43]
-
[44]
Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, An- drew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. 2025. RankLLM: A Python Package for Reranking with LLMs. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3681–3690
work page 2025
-
[45]
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.arXiv preprint arXiv:2408.03314(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Snowflake Inc. 2025. AI SQL. https://docs.snowflake.com/en/user-guide/ snowflake-cortex/aisql. Accessed: 2025-07-31
work page 2025
- [47]
-
[48]
TensorFlow Tutorial. 2023. Word embeddings. https://www.tensorflow.org/text/ guide/word_embeddings. Accessed: 2025-07-29
work page 2023
-
[49]
text2vec.org. 2018. Vectorization. https://text2vec.org/vectorization.html. Ac- cessed: 2025-07-29
work page 2018
-
[50]
The Devastator. 2022. DBpedia Ontology: Text Classification Dataset with 14 Classes. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology- dataset Kaggle Dataset
work page 2022
-
[51]
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal
-
[52]
FEVER: a Large-scale Dataset for Fact Extraction and VERification. In NAACL-HLT
-
[53]
Matthias Urban and Carsten Binnig. 2024. Eleet: Efficient learned query execution over text and tables.Proceedings of the VLDB Endowment17, 13 (2024), 4867–4880
work page 2024
-
[54]
Enzo Veltri, Donatello Santoro, Jean-Flavien Bussotti, and Paolo Papotti. 2025. Logical and physical optimizations for SQL query execution over large language models.Proc. ACM Manag. Data3, 3 (2025), 181:1–181:28. doi:10.1145/3725411
-
[55]
David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7545–7557. doi:10.18653/v1/2020.emnlp-main.609
-
[56]
Ellery Wulczyn, Nithum Thain, and Samuel Dixon. 2017. Ex Machina: Personal Attacks Detoxified. InProceedings of the 26th International Conference on World Wide Web (WWW). 1371–1379. doi:10.1145/3038912.3052591
-
[57]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics
work page 2019
-
[58]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. InAdvances in Neural Information Processing Systems (NIPS), Vol. 28. 649–657. https://proceedings.neurips.cc/paper_files/ paper/2015/file/8559aa24a0d8102d861d85d03831b0e5-Paper.pdf
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.