Access Paths for Efficient Ordering with Large Language Models

Amr El Abbadi; Anupam Datta; Dimitris Tsirogiannis; Divyakant Agrawal; Fuheng Zhao; Jiayue Chen; Paritosh Aggarwal; Sohaib; Tahseen Rabbani; Yiming Pan

arxiv: 2509.00303 · v3 · pith:UPX63FWDnew · submitted 2025-08-30 · 💻 cs.DB · cs.AI· cs.IR

Access Paths for Efficient Ordering with Large Language Models

Fuheng Zhao , Jiayue Chen , Yiming Pan , Tahseen Rabbani , Sohaib , Divyakant Agrawal , Amr El Abbadi , Paritosh Aggarwal

show 2 more authors

Anupam Datta Dimitris Tsirogiannis

This is my paper

Pith reviewed 2026-05-21 22:28 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.IR

keywords semantic operatorsLLM orderingaccess pathsquery optimizationlarge language modelsexternal merge sortranking accuracydatabase systems

0 comments

The pith

A budget-aware optimizer dynamically selects near-optimal access paths for LLM-based ordering that match or exceed the accuracy of the best static methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats LLM ORDER BY as a new logical operator in databases and examines its possible physical implementations. It improves existing semantic sorting methods and adds a semantic-aware external merge sort, then shows through experiments that different implementations win on different datasets. From the observed relationship between computation cost and ranking quality, the authors build an optimizer that uses simple rules plus LLM judgments with consensus to pick a good implementation on the fly for a given budget. If this holds, LLM-powered analytic queries can avoid committing to one fixed sorting strategy and still reach high accuracy at controlled cost.

Core claim

No single physical implementation of semantic ordering is optimal across all datasets; a test-time scaling law links sorting cost to ordering quality for comparison-based methods; therefore a budget-aware optimizer that combines heuristics, LLM-as-Judge scoring, and consensus aggregation can choose access paths whose resulting ranking accuracy is on par with or better than the best static choice on every benchmark tested.

What carries the argument

The budget-aware optimizer for the LLM ORDER BY operator, which applies heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to pick a near-optimal physical access path at runtime.

If this is right

Runtime selection removes the need to commit to one sorting algorithm before seeing the data or the budget.
Semantic ordering becomes practical for large tables because the merge-sort variant and the optimizer together control both cost and quality.
LLM-powered database systems can treat ordering as an optimizable operator rather than a fixed black-box step.
The same selection logic could be reused when other semantic operators are added to analytic pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar optimizers might later be applied to other LLM-based operators such as joins or filters.
The observed cost-quality scaling could be used to set budgets automatically in production query planners.
Extending the approach to streaming or incremental data would require only modest changes to the merge-sort component.

Load-bearing premise

That evaluations based on LLM-as-Judge with consensus aggregation produce stable, unbiased estimates of ordering quality that hold for datasets and models beyond those used in the study.

What would settle it

Apply the optimizer to a fresh dataset whose true ordering is known, run each static method to completion, and check whether the dynamically chosen path ever falls more than a small margin below the accuracy of the single best static method.

Figures

Figures reproduced from arXiv: 2509.00303 by Amr El Abbadi, Anupam Datta, Dimitris Tsirogiannis, Divyakant Agrawal, Fuheng Zhao, Jiayue Chen, Paritosh Aggarwal, Sohaib, Tahseen Rabbani, Yiming Pan.

**Figure 2.** Figure 2: Accuracy distribution across algorithms for the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Sorting accuracy vs. the monetary budget ($). [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Llama models’ accuracy distribution on the DL20 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Limiting the optimizer budget to $6 and $70. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a budget-aware optimizer for LLM-based semantic ordering that mixes heuristics, LLM-as-Judge, and consensus to pick access paths dynamically, but the accuracy claims rest on uncalibrated LLM judgments.

read the letter

The key point for you is that this work gives a concrete optimizer for choosing among semantic sort implementations at query time, and it reports that the dynamic choice matches or beats the best fixed method on their benchmarks. They also add a semantic-aware external merge sort and note a scaling trend between cost and quality for comparison-based approaches. No single static method wins everywhere, which is a useful observation for anyone building LLM-augmented query engines.

Referee Report

2 major / 1 minor

Summary. The paper introduces the LLM ORDER BY semantic operator as a logical abstraction and systematically studies its physical implementations. It proposes improvements to existing semantic sorting algorithms along with a semantic-aware external merge sort, observes a general test-time scaling relationship between sorting cost and ordering quality for comparison-based methods, and designs a budget-aware optimizer that combines heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to select near-optimal access paths. The central empirical claim is that this optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks.

Significance. If the empirical results hold under proper validation, the work offers foundational insights into principled optimization of semantic operators for LLM-powered analytic systems. The systematic study of access paths, the introduction of a semantic-aware external merge sort, and the identification of a scaling relationship between cost and quality are concrete strengths that could inform future database designs integrating large language models.

major comments (2)

[Evaluation section] Evaluation section: The headline claim that the optimizer 'consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks' (abstract) rests on LLM-as-Judge evaluations with consensus aggregation, yet no correlation to human ground truth, inter-annotator agreement, or objective proxies (such as downstream query precision on labeled data) is reported. This assumption is load-bearing for the superiority result and risks circular reinforcement if the same judge mechanism influences both path selection and accuracy measurement.
[Evaluation section] Evaluation section: The abstract reports high-level observations that no single method is universally optimal and that the optimizer matches or beats static baselines, but provides no quantitative error bars, dataset sizes, number of runs, or statistical tests. Without these, the reliability and generalizability of the central empirical claim cannot be assessed.

minor comments (1)

[Abstract] The abstract could specify the concrete benchmarks, models, and dataset scales used in the 'extensive evaluations' to help readers immediately gauge the scope of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address each major comment below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The headline claim that the optimizer 'consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks' (abstract) rests on LLM-as-Judge evaluations with consensus aggregation, yet no correlation to human ground truth, inter-annotator agreement, or objective proxies (such as downstream query precision on labeled data) is reported. This assumption is load-bearing for the superiority result and risks circular reinforcement if the same judge mechanism influences both path selection and accuracy measurement.

Authors: We agree that the shared use of LLM-as-Judge for both optimizer path selection and final accuracy measurement introduces a risk of circular reinforcement. In the revised manuscript, we will add a dedicated validation subsection that reports correlation between LLM-as-Judge scores and human annotations on a sampled subset of queries from each benchmark. We will also report inter-annotator agreement statistics and, where possible, an objective proxy such as precision on a labeled downstream task. This addition will clarify the reliability of the judge mechanism independent of the optimizer. revision: yes
Referee: [Evaluation section] Evaluation section: The abstract reports high-level observations that no single method is universally optimal and that the optimizer matches or beats static baselines, but provides no quantitative error bars, dataset sizes, number of runs, or statistical tests. Without these, the reliability and generalizability of the central empirical claim cannot be assessed.

Authors: We acknowledge that the current presentation lacks the quantitative details necessary for assessing reliability. In the revised evaluation section and abstract, we will explicitly report dataset sizes, the number of independent runs per experiment, error bars (standard deviation or confidence intervals), and results of statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing the optimizer against the best static baseline on each benchmark. These additions will be placed in the main evaluation tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives its budget-aware optimizer from empirical observations of test-time scaling relationships between sorting cost and ordering quality, obtained through evaluations of multiple semantic sorting implementations including a proposed semantic-aware external merge sort. The optimizer then applies heuristic rules, LLM-as-Judge evaluations, and consensus aggregation to select access paths. Reported ranking accuracy is presented as an outcome of extensive benchmark comparisons against static methods. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would make any central claim equivalent to its inputs by construction. The derivation remains self-contained via experimental methodology and independent benchmark results rather than tautological reuse of fitted values or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations of scaling behavior and on the reliability of LLM-as-Judge plus consensus; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1033 out tokens · 35957 ms · 2026-05-21T22:28:58.651783+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers
cs.IR 2026-04 unverdicted novelty 7.0

Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

TruLens: Evals and Tracing for LLMs and Agents

[n.d.]. TruLens: Evals and Tracing for LLMs and Agents. https://www.trulens. org/. Accessed: 2025-11-25

work page 2025
[2]

Paritosh Aggarwal, Bowei Chen, Anupam Datta, Benjamin Han, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Pawel Liskowski, Jay Tayade, Dimitris Tsirogiannis, Nathan Wiegand, and Weicheng Zhao. 2025. Cortex AISQL: A Production SQL Engine for Unstructured Data. arXiv:2511.07663 [cs.DB] https://arxiv.org/abs/2511.07663

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Meta AI. 2024. Llama 3.1: Multilingual, Long-Context Large Language Models (8 B, 70 B, 405 B). https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 2025-11-24

work page 2024
[4]

Ashwin Alaparthi, Paul Loh, and Ryan Marcus. 2025. ScaleLLM: A Technique for Scalable LLM-augmented Data Systems. InCompanion of the 2025 International Conference on Management of Data. 11–14

work page 2025
[5]

Amazon Web Services. 2024. Bringing Generative AI to the Data Warehouse with Amazon Bedrock and Amazon Redshift. https: //repost.aws/articles/ARJszlMEepRti6xoM-0fsBmw/bringing-generative- ai-to-the-data-warehouse-with-amazon-bedrock-and-amazon-redshift AWS re:Post article; accessed: 2025-08-17

work page 2024
[6]

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke, and Leonardo Neves. 2020. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1644–1650. https: //doi.org/10.18653/v1/2020.findings-emnlp.148

work page doi:10.18653/v1/2020.findings-emnlp.148 2020
[7]

BerriAI. 2025. LiteLLM: Python SDK and proxy server for calling 100+ LLM APIs. GitHub repository. https://github.com/BerriAI/litellm Accessed: 2025-11-28

work page 2025
[8]

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. 2021. When is memorization of irrelevant training data necessary for high-accuracy learning?. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC ’21). ACM, 123–132. https://doi.org/10.1145/3406325.3451131

work page doi:10.1145/3406325.3451131 2021
[9]

Yu Chen, Ke Yi, Jun Zhang, and Guoliang Li. 2006. Two-Level Sampling for Join Size Estimation. InProceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD ’06). ACM, Chicago, Illinois, USA, 759–770. https://doi.org/10.1145/1142473.1142571

work page doi:10.1145/1142473.1142571 2006
[10]

Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch Prompting: Efficient Inference with Large Language Model APIs. arXiv:2301.08721 [cs.CL] https: //arxiv.org/abs/2301.08721

work page arXiv 2023
[11]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track.CoRR abs/2003.07820 (2020). arXiv:2003.07820 https://arxiv.org/abs/2003.07820

work page arXiv 2020
[12]

Voorhees, and Ian Soboroff

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2369–2375. https://doi.org/10.1145/3404835.3463249

work page doi:10.1145/3404835.3463249 2021
[13]

Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xu- anhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, et al

work page
[14]

arXiv preprint arXiv:2310.14408(2023)

Parade: Passage ranking using demonstrations with large language models. arXiv preprint arXiv:2310.14408(2023)

work page arXiv 2023
[15]

Peter Emerson. 2013. The original Borda count and partial voting.Social Choice and Welfare40, 2 (2013), 353–358. https://doi.org/10.1007/s00355-011-0603-9

work page doi:10.1007/s00355-011-0603-9 2013
[16]

Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, et al

work page
[17]

Nl2sql is a solved problem... not!. InCIDR

work page
[18]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363(2023)

work page arXiv 2023
[19]

Gibbons and Yossi Matias

Phillip B. Gibbons and Yossi Matias. 1998. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. InProceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98). ACM, Seattle, Washington, USA, 331–342. https://doi.org/10.1145/276304.276346

work page doi:10.1145/276304.276346 1998
[20]

Parker Glenn, Parag Pravin Dakle, Liang Wang, and Preethi Raghavan. 2024. Blendsql: A scalable dialect for unifying hybrid question answering in relational algebra.arXiv preprint arXiv:2402.17882(2024)

work page arXiv 2024
[21]

Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. 2025. SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL.arXiv preprint arXiv:2506.04494(2025)

work page arXiv 2025
[22]

Google Cloud. 2025. Introduction to AI and ML in BigQuery. https://cloud. google.com/bigquery/docs/bqml-introduction Accessed: 2025-08-17

work page 2025
[23]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Zijian He, Reyna Abhyankar, Vikranth Srivatsa, and Yiying Zhang. 2025. Cognify: Supercharging gen-ai workflows with hierarchical autotuning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 932–943

work page 2025
[25]

Steven Heilman. 2022. Noise Stability of Ranked Choice Voting. arXiv:2209.11183 [cs.GT] https://arxiv.org/abs/2209.11183

work page arXiv 2022
[26]

Daomin Ji, Hui Luo, Zhifeng Bao, and Shane Culpepper. 2025. Table integration in data lakes unleashed: pairwise integrability judgment, integrable set discovery, and multi-tuple conflict resolution.The VLDB Journal34, 36 (2025). https: //doi.org/10.1007/s00778-025-00917-9

work page doi:10.1007/s00778-025-00917-9 2025
[27]

Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

work page 1938
[28]

Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, and Hyunsouk Cho. 2024. FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to- SQL Benchmark. arXiv:2409.19014 [cs.CL] https://arxiv.org/abs/2409.19014

work page arXiv 2024
[29]

Donald E. Knuth. 1997.The Art of Computer Programming(3 ed.). Vol. 1. Addison- Wesley, Reading, MA

work page 1997
[30]

Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Im- manuel Trummer. 2025. SemBench: A Benchmark for Semantic Query Processing Engines. arXiv:2511.01716 [cs.DB] https://arxiv.org/abs/...

work page arXiv 2025
[31]

Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, and Huan Liu. 2024. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge.arXiv preprint abs/2411.16594 (2024). https://arxiv.org/abs/2411.16594

work page arXiv 2024
[32]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357

work page 2023
[33]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

work page
[34]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

work page 2021
[36]

Parameswaran

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G. Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents. arXiv:2501.06659 [cs.DB] https://arxiv.org/abs/ 2501.06659

work page arXiv 2025
[37]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

work page 2025
[38]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Jian Luo, Xuanang Chen, Ben He, and Le Sun. 2024. Prp-graph: Pairwise rank- ing prompting to llms with graph aggregation for effective text re-ranking. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5766–5776

work page 2024
[40]

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero- shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156(2023)

work page arXiv 2023
[41]

Conrado Martínez. 2004. Partial Quicksort. InProceedings of the 6th Workshop on Algorithm Engineering and Experiments and the 1st Workshop on Analytic Algorithmics and Combinatorics (ALENEX/ANALCO). SIAM, New Orleans, LA, USA, 1–8

work page 2004
[42]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. 2025. How much do language models memorize? arXiv:2505.24832 [cs.CL] https: //arxiv.org/abs/2505.24832

work page arXiv 2025
[43]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. arXiv:2003.06713 [cs.IR] https: //arxiv.org/abs/2003.06713

work page arXiv 2020
[44]

OpenAI. 2024. OpenAI FAQ: How should I set the temperature parameter? https://platform.openai.com/docs/faq/faq. Accessed: 2025-11-30

work page 2024
[45]

OpenAI. 2024. Structured model outputs. OpenAI API Guide. https://platform. openai.com/docs/guides/structured-outputs Structured outputs ensure model responses adhere to a supplied JSON Schema. Accessed: 2025-08-25

work page 2024
[46]

OpenIntro. 2025. NBA Player Heights (2008–09 Season). R package openintro, dataset nba_heights. Available at https://www.openintro.org/data/index.php? data=nba_heights, accessed 2025-08-25

work page 2025
[47]

Christos Chrysovalantis Papadopoulos, Alkis Simitsis, and Torben Bach Pedersen

work page
[48]

InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE)

HAIDES: Adaptive Approximation of Inference Queries over Unstructured Data. InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE). 2394–2407

work page
[49]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured 13 data.arXiv e-prints(2024), arXiv–2407

work page 2024
[50]

Tanu Prabhu. 2020. Population by Country — 2020. https://www.kaggle.com/ datasets/tanuprabhu/population-by-country-2020. Accessed: 2025-11-24

work page 2020
[51]

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088(2023)

work page arXiv 2023
[52]

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563(2023)

work page arXiv 2023
[53]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009
[54]

Donald G. Saari. 2023. Selecting a Voting Method: The Case for the Borda Count. Constitutional Political Economy34, 3 (2023), 357–366. https://doi.org/10.1007/ s10602-022-09380-y

work page 2023
[55]

Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen- tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving passage retrieval with zero-shot question generation.arXiv preprint arXiv:2204.07496(2022)

work page arXiv 2022
[56]

Dario Satriani, Enzo Veltri, Donatello Santoro, Sara Rosato, Simone Varriale, and Paolo Papotti. 2025. Logical and Physical Optimizations for SQL Query Execution over Large Language Models.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

work page 2025
[57]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. InProceedings of the 1979 ACM SIGMOD international conference on Management of data. 23–34

work page 1979
[58]

Nihar B Shah and Martin J Wainwright. 2018. Simple, robust and optimal ranking from pairwise comparisons.Journal of machine learning research18, 199 (2018), 1–38

work page 2018
[59]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024
[60]

Snowflake, Inc. 2025. Snowflake Cortex AISQL (including LLM functions). https://docs.snowflake.com/user-guide/snowflake-cortex/aisql?lang=de/ Pre- view feature documentation; accessed: 2025-08-17

work page 2025
[61]

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.arXiv preprint arXiv:2508.05002(2025)

work page arXiv 2025
[62]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

work page arXiv 2023
[63]

Zhaoze Sun, Qiyan Deng, Chengliang Chai, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. 2025. QUEST: Query Optimization in Unstructured Document Analysis. InProceedings of the VLDB Endowment

work page 2025
[64]

Immanuel Trummer. 2025. Implementing Semantic Join Operators Efficiently. arXiv:2510.08489 [cs.DB] https://arxiv.org/abs/2510.08489

work page arXiv 2025
[65]

Hongtao Wang, Taiyan Zhang, Renchi Yang, and Jianliang Xu. 2025. Cequel: Cost-Effective Querying of Large Language Models for Text Clustering. InPro- ceedings of the 34th ACM International Conference on Information and Knowl- edge Management (CIKM). Association for Computing Machinery, 2998–3008. https://doi.org/10.1145/3746252.3761074

work page doi:10.1145/3746252.3761074 2025
[66]

Jiayi Wang, Yuan Li, Jianming Wu, Shihui Xu, and Guoliang Li. 2025. Unify: A System For Unstructured Data Analytics.Proceedings of the VLDB Endowment 18, 12 (2025), 5287–5290. https://doi.org/10.14778/3750601.3750653

work page doi:10.14778/3750601.3750653 2025
[67]

Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Al- balak, Kexun Zhang, and William Yang Wang. 2025. Generalization v.s. Mem- orization: Tracing Language Models’ Capabilities Back to Pretraining Data. arXiv:2407.14985 [cs.CL] https://arxiv.org/abs/2407.14985

work page arXiv 2025
[68]

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. InConference on learning theory. PMLR, 25–54

work page 2013
[69]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022
[70]

Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. 2024. Membership Inference Attacks Against In-Context Learning. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24). ACM, Salt Lake City, UT, USA, 3481–3495. https://doi.org/10.1145/3658644.3690306

work page doi:10.1145/3658644.3690306 2024
[71]

Sampling-Based Query Re-Optimization

Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. arXiv:1601.05748 [cs.DB] https://arxiv.org/abs/1601. 05748

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. 2025. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. arXiv:2505.20139 [cs.SE]...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Victor Zakhary, Lawrence Lim, Divyakant Agrawal, and Amr El Abbadi. 2020. CoT: Decentralized elastic caches for cloud environments.arXiv preprint arXiv:2006.08067(2020)

work page arXiv 2020
[74]

Sirui Zeng and Xifeng Yan. 2025. ADL: A Declarative Language for Agent-Based Chatbots.arXiv preprint arXiv:2504.14787(2025)

work page arXiv 2025
[75]

Fuheng Zhao, Divyakant Agrawal, and Amr El Abbadi. 2024. Hybrid query- ing over relational databases and large language models.arXiv preprint arXiv:2408.00884(2024)

work page arXiv 2024
[76]

Fuheng Zhao, Jiayue Chen, Lawrence Lim, Ishtiyaque Ahmad, Divyakant Agrawal, and Amr El Abbadi. 2023. Llm-sql-solver: Can llms determine SQL equivalence?arXiv preprint arXiv:2312.10321(2023)

work page arXiv 2023
[77]

Fuheng Zhao, Shaleen Deep, Fotis Psallidas, Avrilia Floratou, Divyakant Agrawal, and Amr El Abbadi. 2024. Sphinteract: Resolving Ambiguities in NL2SQL through User Interaction.Proceedings of the VLDB Endowment18, 4 (2024), 1145–1158

work page 2024
[78]

Zhanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, and Meihui Zhang. 2025. NeurDB: On the Design and Implementation of an AI-powered Autonomous Database. arXiv:2408.03013 [cs.DB] https://arxiv.org/abs/2408.03013

work page arXiv 2025
[79]

Lixi Zhou, Qi Lin, Kanchan Chowdhury, Saif Masood, Alexandre Eichenberger, Hong Min, Alexander Sim, Jie Wang, Yida Wang, Kesheng Wu, et al. 2024. Serv- ing Deep Learning Models from Relational Databases.Advances in Database Technology-EDBT27, 3 (2024), 717–724

work page 2024
[80]

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2023. Large lan- guage models for information retrieval: A survey.arXiv preprint arXiv:2308.07107 (2023)

work page arXiv 2023

Showing first 80 references.

[1] [1]

TruLens: Evals and Tracing for LLMs and Agents

[n.d.]. TruLens: Evals and Tracing for LLMs and Agents. https://www.trulens. org/. Accessed: 2025-11-25

work page 2025

[2] [2]

Paritosh Aggarwal, Bowei Chen, Anupam Datta, Benjamin Han, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Pawel Liskowski, Jay Tayade, Dimitris Tsirogiannis, Nathan Wiegand, and Weicheng Zhao. 2025. Cortex AISQL: A Production SQL Engine for Unstructured Data. arXiv:2511.07663 [cs.DB] https://arxiv.org/abs/2511.07663

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Meta AI. 2024. Llama 3.1: Multilingual, Long-Context Large Language Models (8 B, 70 B, 405 B). https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 2025-11-24

work page 2024

[4] [4]

Ashwin Alaparthi, Paul Loh, and Ryan Marcus. 2025. ScaleLLM: A Technique for Scalable LLM-augmented Data Systems. InCompanion of the 2025 International Conference on Management of Data. 11–14

work page 2025

[5] [5]

Amazon Web Services. 2024. Bringing Generative AI to the Data Warehouse with Amazon Bedrock and Amazon Redshift. https: //repost.aws/articles/ARJszlMEepRti6xoM-0fsBmw/bringing-generative- ai-to-the-data-warehouse-with-amazon-bedrock-and-amazon-redshift AWS re:Post article; accessed: 2025-08-17

work page 2024

[6] [6]

Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke, and Leonardo Neves. 2020. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1644–1650. https: //doi.org/10.18653/v1/2020.findings-emnlp.148

work page doi:10.18653/v1/2020.findings-emnlp.148 2020

[7] [7]

BerriAI. 2025. LiteLLM: Python SDK and proxy server for calling 100+ LLM APIs. GitHub repository. https://github.com/BerriAI/litellm Accessed: 2025-11-28

work page 2025

[8] [8]

Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. 2021. When is memorization of irrelevant training data necessary for high-accuracy learning?. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC ’21). ACM, 123–132. https://doi.org/10.1145/3406325.3451131

work page doi:10.1145/3406325.3451131 2021

[9] [9]

Yu Chen, Ke Yi, Jun Zhang, and Guoliang Li. 2006. Two-Level Sampling for Join Size Estimation. InProceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD ’06). ACM, Chicago, Illinois, USA, 759–770. https://doi.org/10.1145/1142473.1142571

work page doi:10.1145/1142473.1142571 2006

[10] [10]

Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch Prompting: Efficient Inference with Large Language Model APIs. arXiv:2301.08721 [cs.CL] https: //arxiv.org/abs/2301.08721

work page arXiv 2023

[11] [11]

Voorhees

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track.CoRR abs/2003.07820 (2020). arXiv:2003.07820 https://arxiv.org/abs/2003.07820

work page arXiv 2020

[12] [12]

Voorhees, and Ian Soboroff

Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2369–2375. https://doi.org/10.1145/3404835.3463249

work page doi:10.1145/3404835.3463249 2021

[13] [13]

Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xu- anhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, et al

work page

[14] [14]

arXiv preprint arXiv:2310.14408(2023)

Parade: Passage ranking using demonstrations with large language models. arXiv preprint arXiv:2310.14408(2023)

work page arXiv 2023

[15] [15]

Peter Emerson. 2013. The original Borda count and partial voting.Social Choice and Welfare40, 2 (2013), 353–358. https://doi.org/10.1007/s00355-011-0603-9

work page doi:10.1007/s00355-011-0603-9 2013

[16] [16]

Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, et al

work page

[17] [17]

Nl2sql is a solved problem... not!. InCIDR

work page

[18] [18]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363(2023)

work page arXiv 2023

[19] [19]

Gibbons and Yossi Matias

Phillip B. Gibbons and Yossi Matias. 1998. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. InProceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98). ACM, Seattle, Washington, USA, 331–342. https://doi.org/10.1145/276304.276346

work page doi:10.1145/276304.276346 1998

[20] [20]

Parker Glenn, Parag Pravin Dakle, Liang Wang, and Preethi Raghavan. 2024. Blendsql: A scalable dialect for unifying hybrid question answering in relational algebra.arXiv preprint arXiv:2402.17882(2024)

work page arXiv 2024

[21] [21]

Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. 2025. SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL.arXiv preprint arXiv:2506.04494(2025)

work page arXiv 2025

[22] [22]

Google Cloud. 2025. Introduction to AI and ML in BigQuery. https://cloud. google.com/bigquery/docs/bqml-introduction Accessed: 2025-08-17

work page 2025

[23] [23]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Zijian He, Reyna Abhyankar, Vikranth Srivatsa, and Yiying Zhang. 2025. Cognify: Supercharging gen-ai workflows with hierarchical autotuning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 932–943

work page 2025

[25] [25]

Steven Heilman. 2022. Noise Stability of Ranked Choice Voting. arXiv:2209.11183 [cs.GT] https://arxiv.org/abs/2209.11183

work page arXiv 2022

[26] [26]

Daomin Ji, Hui Luo, Zhifeng Bao, and Shane Culpepper. 2025. Table integration in data lakes unleashed: pairwise integrability judgment, integrable set discovery, and multi-tuple conflict resolution.The VLDB Journal34, 36 (2025). https: //doi.org/10.1007/s00778-025-00917-9

work page doi:10.1007/s00778-025-00917-9 2025

[27] [27]

Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

work page 1938

[28] [28]

Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, and Hyunsouk Cho. 2024. FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to- SQL Benchmark. arXiv:2409.19014 [cs.CL] https://arxiv.org/abs/2409.19014

work page arXiv 2024

[29] [29]

Donald E. Knuth. 1997.The Art of Computer Programming(3 ed.). Vol. 1. Addison- Wesley, Reading, MA

work page 1997

[30] [30]

Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Im- manuel Trummer. 2025. SemBench: A Benchmark for Semantic Query Processing Engines. arXiv:2511.01716 [cs.DB] https://arxiv.org/abs/...

work page arXiv 2025

[31] [31]

Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, and Huan Liu. 2024. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge.arXiv preprint abs/2411.16594 (2024). https://arxiv.org/abs/2411.16594

work page arXiv 2024

[32] [32]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357

work page 2023

[33] [33]

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

work page

[34] [34]

Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[35] [35]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

work page 2021

[36] [36]

Parameswaran

Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G. Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents. arXiv:2501.06659 [cs.DB] https://arxiv.org/abs/ 2501.06659

work page arXiv 2025

[37] [37]

Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

work page 2025

[38] [38]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Jian Luo, Xuanang Chen, Ben He, and Le Sun. 2024. Prp-graph: Pairwise rank- ing prompting to llms with graph aggregation for effective text re-ranking. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5766–5776

work page 2024

[40] [40]

Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero- shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156(2023)

work page arXiv 2023

[41] [41]

Conrado Martínez. 2004. Partial Quicksort. InProceedings of the 6th Workshop on Algorithm Engineering and Experiments and the 1st Workshop on Analytic Algorithmics and Combinatorics (ALENEX/ANALCO). SIAM, New Orleans, LA, USA, 1–8

work page 2004

[42] [42]

Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. 2025. How much do language models memorize? arXiv:2505.24832 [cs.CL] https: //arxiv.org/abs/2505.24832

work page arXiv 2025

[43] [43]

Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. arXiv:2003.06713 [cs.IR] https: //arxiv.org/abs/2003.06713

work page arXiv 2020

[44] [44]

OpenAI. 2024. OpenAI FAQ: How should I set the temperature parameter? https://platform.openai.com/docs/faq/faq. Accessed: 2025-11-30

work page 2024

[45] [45]

OpenAI. 2024. Structured model outputs. OpenAI API Guide. https://platform. openai.com/docs/guides/structured-outputs Structured outputs ensure model responses adhere to a supplied JSON Schema. Accessed: 2025-08-25

work page 2024

[46] [46]

OpenIntro. 2025. NBA Player Heights (2008–09 Season). R package openintro, dataset nba_heights. Available at https://www.openintro.org/data/index.php? data=nba_heights, accessed 2025-08-25

work page 2025

[47] [47]

Christos Chrysovalantis Papadopoulos, Alkis Simitsis, and Torben Bach Pedersen

work page

[48] [48]

InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE)

HAIDES: Adaptive Approximation of Inference Queries over Unstructured Data. InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE). 2394–2407

work page

[49] [49]

Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured 13 data.arXiv e-prints(2024), arXiv–2407

work page 2024

[50] [50]

Tanu Prabhu. 2020. Population by Country — 2020. https://www.kaggle.com/ datasets/tanuprabhu/population-by-country-2020. Accessed: 2025-11-24

work page 2020

[51] [51]

Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088(2023)

work page arXiv 2023

[52] [52]

Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563(2023)

work page arXiv 2023

[53] [53]

Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

work page 2009

[54] [54]

Donald G. Saari. 2023. Selecting a Voting Method: The Case for the Borda Count. Constitutional Political Economy34, 3 (2023), 357–366. https://doi.org/10.1007/ s10602-022-09380-y

work page 2023

[55] [55]

Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen- tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving passage retrieval with zero-shot question generation.arXiv preprint arXiv:2204.07496(2022)

work page arXiv 2022

[56] [56]

Dario Satriani, Enzo Veltri, Donatello Santoro, Sara Rosato, Simone Varriale, and Paolo Papotti. 2025. Logical and Physical Optimizations for SQL Query Execution over Large Language Models.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

work page 2025

[57] [57]

P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. InProceedings of the 1979 ACM SIGMOD international conference on Management of data. 23–34

work page 1979

[58] [58]

Nihar B Shah and Martin J Wainwright. 2018. Simple, robust and optimal ranking from pairwise comparisons.Journal of machine learning research18, 199 (2018), 1–38

work page 2018

[59] [59]

Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

work page arXiv 2024

[60] [60]

Snowflake, Inc. 2025. Snowflake Cortex AISQL (including LLM functions). https://docs.snowflake.com/user-guide/snowflake-cortex/aisql?lang=de/ Pre- view feature documentation; accessed: 2025-08-17

work page 2025

[61] [61]

Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.arXiv preprint arXiv:2508.05002(2025)

work page arXiv 2025

[62] [62]

Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

work page arXiv 2023

[63] [63]

Zhaoze Sun, Qiyan Deng, Chengliang Chai, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. 2025. QUEST: Query Optimization in Unstructured Document Analysis. InProceedings of the VLDB Endowment

work page 2025

[64] [64]

Immanuel Trummer. 2025. Implementing Semantic Join Operators Efficiently. arXiv:2510.08489 [cs.DB] https://arxiv.org/abs/2510.08489

work page arXiv 2025

[65] [65]

Hongtao Wang, Taiyan Zhang, Renchi Yang, and Jianliang Xu. 2025. Cequel: Cost-Effective Querying of Large Language Models for Text Clustering. InPro- ceedings of the 34th ACM International Conference on Information and Knowl- edge Management (CIKM). Association for Computing Machinery, 2998–3008. https://doi.org/10.1145/3746252.3761074

work page doi:10.1145/3746252.3761074 2025

[66] [66]

Jiayi Wang, Yuan Li, Jianming Wu, Shihui Xu, and Guoliang Li. 2025. Unify: A System For Unstructured Data Analytics.Proceedings of the VLDB Endowment 18, 12 (2025), 5287–5290. https://doi.org/10.14778/3750601.3750653

work page doi:10.14778/3750601.3750653 2025

[67] [67]

Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Al- balak, Kexun Zhang, and William Yang Wang. 2025. Generalization v.s. Mem- orization: Tracing Language Models’ Capabilities Back to Pretraining Data. arXiv:2407.14985 [cs.CL] https://arxiv.org/abs/2407.14985

work page arXiv 2025

[68] [68]

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. InConference on learning theory. PMLR, 25–54

work page 2013

[69] [69]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

work page 2022

[70] [70]

Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. 2024. Membership Inference Attacks Against In-Context Learning. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24). ACM, Salt Lake City, UT, USA, 3481–3495. https://doi.org/10.1145/3658644.3690306

work page doi:10.1145/3658644.3690306 2024

[71] [71]

Sampling-Based Query Re-Optimization

Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. arXiv:1601.05748 [cs.DB] https://arxiv.org/abs/1601. 05748

work page internal anchor Pith review Pith/arXiv arXiv 2016

[72] [72]

Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. 2025. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. arXiv:2505.20139 [cs.SE]...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Victor Zakhary, Lawrence Lim, Divyakant Agrawal, and Amr El Abbadi. 2020. CoT: Decentralized elastic caches for cloud environments.arXiv preprint arXiv:2006.08067(2020)

work page arXiv 2020

[74] [74]

Sirui Zeng and Xifeng Yan. 2025. ADL: A Declarative Language for Agent-Based Chatbots.arXiv preprint arXiv:2504.14787(2025)

work page arXiv 2025

[75] [75]

Fuheng Zhao, Divyakant Agrawal, and Amr El Abbadi. 2024. Hybrid query- ing over relational databases and large language models.arXiv preprint arXiv:2408.00884(2024)

work page arXiv 2024

[76] [76]

Fuheng Zhao, Jiayue Chen, Lawrence Lim, Ishtiyaque Ahmad, Divyakant Agrawal, and Amr El Abbadi. 2023. Llm-sql-solver: Can llms determine SQL equivalence?arXiv preprint arXiv:2312.10321(2023)

work page arXiv 2023

[77] [77]

Fuheng Zhao, Shaleen Deep, Fotis Psallidas, Avrilia Floratou, Divyakant Agrawal, and Amr El Abbadi. 2024. Sphinteract: Resolving Ambiguities in NL2SQL through User Interaction.Proceedings of the VLDB Endowment18, 4 (2024), 1145–1158

work page 2024

[78] [78]

Zhanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, and Meihui Zhang. 2025. NeurDB: On the Design and Implementation of an AI-powered Autonomous Database. arXiv:2408.03013 [cs.DB] https://arxiv.org/abs/2408.03013

work page arXiv 2025

[79] [79]

Lixi Zhou, Qi Lin, Kanchan Chowdhury, Saif Masood, Alexandre Eichenberger, Hong Min, Alexander Sim, Jie Wang, Yida Wang, Kesheng Wu, et al. 2024. Serv- ing Deep Learning Models from Relational Databases.Advances in Database Technology-EDBT27, 3 (2024), 717–724

work page 2024

[80] [80]

Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2023. Large lan- guage models for information retrieval: A survey.arXiv preprint arXiv:2308.07107 (2023)

work page arXiv 2023