arxiv: 2603.15970 · v6 · submitted 2026-03-16 · 💻 cs.DB · cs.AI

Recognition: no theorem link

100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models

Yeounoh Chung , Rushabh Desai , Jian He , Yu Xiao , Thibaud Hottelier , Yves-Laurent Kom Samo , Pushkar Khadilkar , Xianshun Chen

show 4 more authors

Sam Idicula Fatma \"Ozcan Alon Halevy Yannis Papakonstantinou

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:24 UTC · model grok-4.3

classification 💻 cs.DB cs.AI

keywords AI queriesproxy modelssemantic filterembedding vectorscost reductionlatency reductiondatabase architectureLLM approximation

0 comments

The pith

Proxy models over embedding vectors cut AI query costs and latency by more than 100 times while preserving accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates an approximation method for LLM-based AI queries inside databases. Lightweight proxy models trained on embedding vectors replace expensive LLM calls for semantic filter and ranking operators. The proxies deliver more than 100 times lower cost and latency on benchmarks including a 10-million-row Amazon reviews dataset. Accuracy remains the same or improves, and the work presents concrete architectures for online use in BigQuery and lower-latency setups with offline training in AlloyDB plus faster training techniques.

Core claim

Proxy models trained on embedding vectors can approximate the semantic filter and ranking operations performed by LLMs in AI queries, delivering more than 100 times reduction in cost and latency with no material loss in accuracy across tested datasets and query types.

What carries the argument

Lightweight proxy models trained on embedding vectors to approximate LLM semantic judgments for filter and ranking operators.

Load-bearing premise

Proxy models trained on embeddings can reliably approximate the semantic judgments of underlying LLMs across diverse datasets and query types without material accuracy loss.

What would settle it

A head-to-head comparison on a fresh large dataset where the proxy model and full LLM disagree on a substantial fraction of semantic filter decisions would disprove reliable approximation.

Figures

Figures reproduced from arXiv: 2603.15970 by Alon Halevy, Fatma \"Ozcan, Jian He, Pushkar Khadilkar, Rushabh Desai, Sam Idicula, Thibaud Hottelier, Xianshun Chen, Yannis Papakonstantinou, Yeounoh Chung, Yu Xiao, Yves-Laurent Kom Samo.

**Figure 1.** Figure 1: AI query execution plan construction with proxy model approximation process. We parse the AI query and extract [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Relative wall-clock time of each step of the proxy [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Proxy model performance (nDCG@10 on online [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of sampling strategies on training data imbalance ratios, measured across various datasets of varying degrees [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of different imbalanced label training techniques, as described in Section 4.2. [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Impact of embedding model and dimensionality on proxy model classification performance. Note that Gecko [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Embedding distinctiveness illustrated by PCA visualization (X-axis is PC1, Y-axis is PC2) and separability scores across [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

read the original abstract

Several data warehouse and database providers have recently introduced extensions to SQL called AI Queries, enabling users to specify functions and conditions in SQL that are evaluated by LLMs, thereby broadening significantly the kinds of queries one can express over the combination of structured and unstructured data. LLMs offer remarkable semantic reasoning capabilities, making them an essential tool for complex and nuanced queries that blend structured and unstructured data. While extremely powerful, these AI queries can become prohibitively costly when invoked thousands of times. This paper provides an extensive evaluation of a recent AI query approximation approach that enables low cost analytics and database applications to benefit from AI queries. The approach delivers >100x cost and latency reduction for the semantic filter operator and also important gains for semantic ranking. The cost and performance gains come from utilizing cheap and accurate proxy models over embedding vectors. We show that despite the massive gains in latency and cost, these proxy models preserve accuracy and occasionally improve accuracy across various benchmark datasets, including the extended Amazon reviews benchmark that has 10M rows. We present an OLAP-friendly architecture within Google BigQuery for this approach for purely online (ad hoc) queries, and a low-latency HTAP database-friendly architecture in AlloyDB that could further improve the latency by moving the proxy model training offline. We present techniques that accelerate the proxy model training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Proxy models on embeddings deliver the claimed 100x cost cuts for semantic filters with concrete BigQuery and AlloyDB numbers, but the accuracy claims need tighter validation on shifts and metrics.

read the letter

The key point is that this work shows lightweight proxy models trained on embeddings can approximate LLM semantic filters and rankings inside databases, cutting cost and latency by over 100x on a 10M-row Amazon benchmark while keeping accuracy close to the full model. It also sketches two practical architectures: an online one for BigQuery OLAP and an offline-training one for AlloyDB HTAP, plus some training speedups. Those specific performance figures and system mappings are the new empirical content here. Earlier papers floated the proxy idea; this one supplies the numbers on real warehouse setups and large data. The evaluation across multiple benchmarks is a plus, and the architectures line up with how warehouses actually run ad-hoc versus low-latency workloads. The gains make sense because the proxies avoid repeated LLM calls. The soft spot is the limited detail on how proxy labels are generated, what agreement metric is used, and whether accuracy holds under query or data distribution shifts. The abstract mentions preservation or occasional improvement but gives no error bars, statistical tests, or stress cases outside the reported sets. If approximation error grows on unseen queries, the net savings shrink once you add correction steps. This paper is aimed at database researchers and engineers who want to make semantic operators routine without huge bills. Anyone working on query optimization or AI extensions in warehouses will find the numbers and designs useful to build on. It deserves a serious referee because the empirical grounding is real and the architectures are concrete, even if the robustness section needs more work.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study of using lightweight proxy models over embedding vectors to approximate LLM evaluations in AI SQL queries. It claims >100x cost and latency reductions for semantic filter operators and gains for semantic ranking, with accuracy preserved or improved on benchmarks including a 10M-row Amazon reviews dataset. Architectures are proposed for Google BigQuery (online) and AlloyDB (HTAP with offline training), along with techniques to speed up proxy model training.

Significance. If the accuracy claims hold under broader conditions, this approach could make semantic AI queries viable for large-scale, cost-sensitive database applications by dramatically reducing reliance on expensive LLM calls. The work provides practical architectures and training optimizations that address real deployment challenges in data warehouses.

major comments (2)

The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.
The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.

minor comments (2)

Clarify in the architecture sections how the AlloyDB offline training path quantitatively improves latency over the BigQuery online path, with specific numbers.
The phrase 'occasionally improve accuracy' in the abstract should be backed by explicit examples or delta metrics in the evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the statistical and robustness aspects of our empirical evaluation. We address each major comment below and have revised the manuscript to incorporate additional details, analyses, and clarifications.

read point-by-point responses

Referee: The abstract asserts accuracy preservation (and occasional improvement) on the 10M-row Amazon reviews benchmark and others, yet the evaluation provides no error bars, exclusion criteria, statistical tests, or concrete details on proxy label generation from the LLM, the exact agreement metric (precision/recall vs. end-to-end fidelity), or stress tests for query diversity and distribution shift. This directly undermines assessment of whether approximation error remains low enough to deliver the claimed net cost savings without re-execution.

Authors: We agree that greater statistical rigor strengthens the presentation. The revised manuscript now includes error bars (standard deviation over 5 independent runs) on all accuracy plots, explicit details on proxy label generation (LLM annotations on a 10k-sample training subset per benchmark with temperature=0 for determinism), and clarification that the agreement metric is end-to-end query fidelity measured by precision and recall of the final result set against full-LLM execution. We have added an exclusion-criteria paragraph describing removal of queries with >20% token-length variance and a new stress-test subsection covering 12 query phrasings plus a cross-dataset shift experiment (Amazon reviews to Yelp reviews). These changes allow direct assessment of approximation error relative to the claimed cost savings. revision: partial
Referee: The central performance claim (>100x reduction for semantic filters) rests on proxy models reliably approximating LLM semantic judgments; the manuscript supplies no robustness analysis for unseen query types or domains outside the reported benchmarks, leaving the weakest assumption untested.

Authors: The original evaluation already covers three distinct domains (product reviews, Q&A, and news) with the 10M-row Amazon benchmark as the largest scale test. In revision we have added a dedicated robustness subsection that evaluates the proxy models on 8 held-out query templates per domain and reports accuracy under a controlled distribution shift (training on 2022 reviews, testing on 2023 reviews). We acknowledge that exhaustive coverage of arbitrary unseen domains lies outside the current scope and have therefore added an explicit limitations paragraph plus future-work directions on continual adaptation. revision: partial

Circularity Check

0 steps flagged

Empirical measurement study with no derivation circularity

full rationale

The paper is an empirical evaluation of proxy models for approximating LLM-based AI queries in databases. It reports measured >100x cost/latency reductions and accuracy preservation on external benchmarks (e.g., 10M-row Amazon reviews) by direct comparison of proxy outputs to LLM judgments. No equations, fitted parameters, or self-citations reduce the central performance claims to inputs defined by the same data or prior author work; the results are externally falsifiable measurements rather than self-referential derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical performance study; no new mathematical axioms, free parameters, or invented entities are introduced or required by the central claim. The work relies on standard assumptions that embeddings capture semantic similarity and that small models can be trained to mimic LLM behavior on those embeddings.

pith-pipeline@v0.9.0 · 5587 in / 1123 out tokens · 37876 ms · 2026-05-15T09:24:21.287472+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 4 internal anchors

[1]

Lukas Bruderer and Mihai Ciorobea. 2025. Boost your Search and RAG agents with Vertex AI’s new state-of-the-art Ranking API.Google Cloud Blog(30 May 2025). https://cloud.google.com/blog/products/ai-machine-learning/launching- our-new-state-of-the-art-vertex-ai-ranking-api

work page 2025
[2]

Iñigo Casanueva, Hector Perez-Iglesias, Abhinav Rao, Xiaoxue Liu, Yufan Wang, and Hao Sun. 2020. Efficient Intent Detection with Dual Sentence Encoder and Label-Aware Attention. InProceedings of the 2nd Workshop on Natural Lan- guage Processing for Conversational AI. 79–86. https://aclanthology.org/2020. nlp4convai-1.12/

work page 2020
[3]

Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer

work page
[4]

SMOTE: synthetic minority over-sampling technique.Journal of artificial intelligence research16 (2002), 321–357

work page 2002
[5]

Ethan Chern, Steffi Freihat, Yangni Shieh, Stephen Wan, Junjie Zhao, Wayne Xin Zhao, et al . 2023. FacTool: Factuality Detection in Generative AI – A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios.arXiv preprint arXiv:2307.13528(2023). https://arxiv.org/abs/2307.13528

work page arXiv 2023
[6]

Yeounoh Chung, Tim Kraska, Neoklis Polyzotis, Ki Hyun Tae, and Steven Eui- jong Whang. 2019. Automated data slicing for model validation: A big data-ai integration approach.IEEE Transactions on Knowledge and Data Engineering32, 12 (2019), 2284–2296

work page 2019
[7]

Yeounoh Chung, Tim Kraska, Steven Euijong Whang, and Neoklis Polyzotis. 2019. Slice Finder: Automated Data Slicing for Model Validation. InICDE. 1514–1525. doi:10.1109/ICDE.2019.00138

work page doi:10.1109/icde.2019.00138 2019
[8]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S. Weld

work page
[9]

InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL)

SPECTER: Document-level Representation Learning using Citation- informed Transformers. InProceedings of the 58th Annual Meeting of the Associa- tion for Computational Linguistics (ACL). 4211–4222. doi:10.18653/v1/2020.acl- main.384

work page doi:10.18653/v1/2020.acl- 2020
[10]

Voorhees, and Ian Soboroff

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, and Ian Soboroff. 2022. Overview of the TREC 2022 Deep Learning Track. InProceedings of the Thirty-First Text REtrieval Conference (TREC 2022) (NIST Special Publication 500-338). https://trec.nist.gov/pubs/trec31/papers/ Overview_deep.pdf

work page 2022
[11]

Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, and Dale Schuurmans

work page
[12]

UQE: A Query Engine for Unstructured Databases.Advances in Neural Information Processing Systems37 (2024), 29807–29838

work page 2024
[13]

Databricks. 2025. AI Functions on Databricks. https://docs.databricks.com/aws/ en/large-language-models/ai-functions. Accessed: 2025-07-31

work page 2025
[14]

Anas Dorbani, Sunny Yasser, Jimmy Lin, and Amine Mhedhbi. 2025. Beyond Quacking: Deep Integration of Language Models and RAG into DuckDB.Pro- ceedings of the VLDB Endowment18, 12 (2025), 5415–5418. doi:10.14778/3750601. 3750685

work page doi:10.14778/3750601 2025
[15]

Xinyi Gao, Dongting Xie, Yihang Zhang, Zhengren Wang, Chong Chen, Con- ghui He, Hongzhi Yin, and Wentao Zhang. 2025. A Comprehensive Survey on Imbalanced Data Learning.arXiv preprint arXiv:2502.08960(2025)

work page arXiv 2025
[16]

2025.google/embedding-gemma-300m

Google. 2025.google/embedding-gemma-300m. https://huggingface.co/google/ embeddinggemma-300m

work page 2025
[17]

Google Cloud. 2025. AlloyDB AI. https://cloud.google.com/alloydb/ai?e= 48754805. Accessed: 2025-07-31

work page 2025
[18]

Google Cloud. 2025. BigQuery ML overview. https://cloud.google.com/bigquery/ docs/bqml-introduction. Accessed: July 31, 2025

work page 2025
[19]

2025.Generative AI pricing

Google Cloud. 2025.Generative AI pricing. Google. https://cloud.google.com/ vertex-ai/generative-ai/pricing

work page 2025
[20]

Google DeepMind. 2025. Gemini: Model Thinking Updates. Google DeepMind Blog Post. https://blog.google/technology/google-deepmind/gemini-model- thinking-updates-march-2025/ Accessed: October 2025

work page 2025
[21]

David Greene and Pádraig Cunningham. 2006. Practical Solutions to the Problem of Diagonal Dominance in Kernel Document Clustering. InProceedings of the 23rd International Conference on Machine Learning (ICML). ACM, Pittsburgh, PA, USA, 377–384

work page 2006
[22]

Srinivasan Iyer, Sewon Min, Yashar Mehdad, and Wen-tau Yih. 2021. RECON- SIDER: improved re-ranking using span-focused cross-attention for open domain question answering. InProceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies. 1280–1287

work page 2021
[23]

Zhaocheng Ji, Hao Sun, Jun Du, Yan Xu, Ruiqi Wang, Shuo Yuan, Jing Li, Shanshan Liu, Yuan Li, Wenbo Zhu, and Yan Shen. 2022. MHQA: Multi-hop Question Answering on Mental Health.arXiv preprint arXiv:2210.02111(2022). https: //arxiv.org/abs/2210.02111

work page arXiv 2022
[24]

Saehan Jo and Immanuel Trummer. 2024. Thalamusdb: Approximate query processing on multi-modal data.Proceedings of the ACM on Management of Data 2, 3 (2024), 1–26

work page 2024
[25]

Kaggle. 2020. Tweet Sentiment Extraction - Kaggle Competition. Web link to Kaggle competition. https://www.kaggle.com/c/tweet-sentiment-extraction/ Accessed: October 2025

work page 2020
[26]

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. Dspy: Compiling declarative language model calls into self-improving pipelines.arXiv preprint arXiv:2310.03714(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Aditya Kusupati, Gantavya Bhatt, Aniket Rege, Matthew Wallingford, Aditya Sinha, Vivek Ramanujan, William Howard-Snyder, Kaifeng Chen, Sham Kakade, Prateek Jain, et al. 2022. Matryoshka representation learning.Advances in Neural Information Processing Systems35 (2022), 30233–30249

work page 2022
[28]

Jinhyuk Lee, Feiyang Chen, Sahil Dua, Daniel Cer, Madhuri Shanbhogue, Iftekhar Naim, Gustavo Hernandez Abrego, Zhe Li, Kaifeng Chen, Henrique Schechter Vera, Xiaoqi Ren, Shanfeng Zhang, Daniel Salz, Michael Boratko, Jay Han, Blair Chen, Shuo Huang, Vikram Rao, Paul Suganthan, Feng Han, Andreas Doumanoglou, Nithi Gupta, Fedor Moiseev, Cathy Yip, Aashi Jain...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Jinhyuk Lee, Zhuyun Dai, Xiaoqi Ren, Blair Chen, Daniel Cer, Jeremy R. Cole, Kai Hui, Michael Boratko, Rajvi Kapadia, Wen Ding, Yi Luan, Sai Meher Karthik Duddu, Gustavo Hernandez Abrego, Weiqiang Shi, Nithi Gupta, Aditya Kusupati, Prateek Jain, Siddhartha Reddy Jonnalagadda, Ming-Wei Chang, and Iftekhar Naim. 2024. Gecko: Versatile Text Embeddings Distil...

work page arXiv 2024
[30]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. 2011. Learning Word Vectors for Sentiment Analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 142–150. https://aclanthology.org/ P11-1015/

work page 2011
[31]

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDer- mott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Chal- lenge: Financial Opinion Mining and Question Answering. InWWW ’18 Com- panion: The 2018 Web Conference Companion. ACM, Lyon, France, 1941–1942. doi:10.1145/3184558.3192301

work page doi:10.1145/3184558.3192301 2018
[32]

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. 2025. s1: Simple test-time scaling. arXiv:2501.19393 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Baharan Nouriinanloo and Maxime Lamothe. 2024. Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models.arXiv preprint arXiv:2406.18740(2024)

work page arXiv 2024
[34]

OpenAI. 2022. Classification using embeddings. https://cookbook.openai.com/ examples/classification_using_embeddings. Accessed: 2025-07-29

work page 2022
[35]

Kelley Pace and Ronald Barry

R. Kelley Pace and Ronald Barry. 1997. Sparse Spatial Autoregressions.Statistics and Probability Letters33, 3 (1997), 291–297. doi:10.1016/S0167-7152(96)00076-4

work page doi:10.1016/s0167-7152(96)00076-4 1997
[36]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv e-prints(2024), arXiv–2407

work page 2024
[37]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2025. Palimpzest: Optimizing AI-Powered Analytics with Declarative Query Processing. InProceed- ings of the 11th International Conference on Very Large Databases (CIDR 2025). Am- sterdam, The Netherlands, 12. https://vldb.org/cidrdb/papers/2025/p12-liu.pdf

work page 2025
[38]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...

work page 2011
[39]

Voorhees, Lucy Lu Wang, and William R

Kirk Roberts, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, Kyle Lo, Ian Soboroff, Ellen M. Voorhees, Lucy Lu Wang, and William R. Hersh. 2021. Searching for scientific evidence in a pandemic: An overview of TREC-COVID. Journal of Biomedical Informatics121 (2021), 103865. doi:10.1016/j.jbi.2021.103865

work page doi:10.1016/j.jbi.2021.103865 2021
[40]

Matthew Russo, Sivaprasad Sudhir, Gerardo Vitagliano, Chunwei Liu, Tim Kraska, Samuel Madden, and Michael Cafarella. 2025. Abacus: A Cost-Based Optimizer 100x Cost & Latency Reduction: Performance Analysis of AI Query Approximation using Lightweight Proxy Models for Semantic Operator Systems.arXiv preprint arXiv:2505.14661(2025)

work page arXiv 2025
[41]

Elvis Saravia, Hsien-Che Liu, Yen-Hao Huang, Ssu-Han Wu, and Yi-Shin Chen

work page
[42]

InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)

CARER: Contextualized Affect Representations for Emotion Recognition. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3689–3698. https://aclanthology.org/D18-1404/

work page 2018
[43]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024
[44]

Sahel Sharifymoghaddam, Ronak Pradeep, Andre Slavescu, Ryan Nguyen, An- drew Xu, Zijian Chen, Yilin Zhang, Yidi Chen, Jasper Xian, and Jimmy Lin. 2025. RankLLM: A Python Package for Reranking with LLMs. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 3681–3690

work page 2025
[45]

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.arXiv preprint arXiv:2408.03314(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Snowflake Inc. 2025. AI SQL. https://docs.snowflake.com/en/user-guide/ snowflake-cortex/aisql. Accessed: 2025-07-31

work page 2025
[47]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

work page arXiv 2023
[48]

TensorFlow Tutorial. 2023. Word embeddings. https://www.tensorflow.org/text/ guide/word_embeddings. Accessed: 2025-07-29

work page 2023
[49]

text2vec.org. 2018. Vectorization. https://text2vec.org/vectorization.html. Ac- cessed: 2025-07-29

work page 2018
[50]

The Devastator. 2022. DBpedia Ontology: Text Classification Dataset with 14 Classes. https://www.kaggle.com/datasets/thedevastator/dbpedia-ontology- dataset Kaggle Dataset

work page 2022
[51]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal

work page
[52]

In NAACL-HLT

FEVER: a Large-scale Dataset for Fact Extraction and VERification. In NAACL-HLT

work page
[53]

Matthias Urban and Carsten Binnig. 2024. Eleet: Efficient learned query execution over text and tables.Proceedings of the VLDB Endowment17, 13 (2024), 4867–4880

work page 2024
[54]

Enzo Veltri, Donatello Santoro, Jean-Flavien Bussotti, and Paolo Papotti. 2025. Logical and physical optimizations for SQL query execution over large language models.Proc. ACM Manag. Data3, 3 (2025), 181:1–181:28. doi:10.1145/3725411

work page doi:10.1145/3725411 2025
[55]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7545–7557. doi:10.18653/v1/2020.emnlp-main.609

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[56]

Ellery Wulczyn, Nithum Thain, and Samuel Dixon. 2017. Ex Machina: Personal Attacks Detoxified. InProceedings of the 26th International Conference on World Wide Web (WWW). 1371–1379. doi:10.1145/3038912.3052591

work page doi:10.1145/3038912.3052591 2017
[57]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. HellaSwag: Can a Machine Really Finish Your Sentence?. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics

work page 2019
[58]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level Convolutional Networks for Text Classification. InAdvances in Neural Information Processing Systems (NIPS), Vol. 28. 649–657. https://proceedings.neurips.cc/paper_files/ paper/2015/file/8559aa24a0d8102d861d85d03831b0e5-Paper.pdf

work page 2015