pith. sign in

arxiv: 2509.00303 · v3 · pith:UPX63FWDnew · submitted 2025-08-30 · 💻 cs.DB · cs.AI· cs.IR

Access Paths for Efficient Ordering with Large Language Models

Pith reviewed 2026-05-21 22:28 UTC · model grok-4.3

classification 💻 cs.DB cs.AIcs.IR
keywords semantic operatorsLLM orderingaccess pathsquery optimizationlarge language modelsexternal merge sortranking accuracydatabase systems
0
0 comments X

The pith

A budget-aware optimizer dynamically selects near-optimal access paths for LLM-based ordering that match or exceed the accuracy of the best static methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper treats LLM ORDER BY as a new logical operator in databases and examines its possible physical implementations. It improves existing semantic sorting methods and adds a semantic-aware external merge sort, then shows through experiments that different implementations win on different datasets. From the observed relationship between computation cost and ranking quality, the authors build an optimizer that uses simple rules plus LLM judgments with consensus to pick a good implementation on the fly for a given budget. If this holds, LLM-powered analytic queries can avoid committing to one fixed sorting strategy and still reach high accuracy at controlled cost.

Core claim

No single physical implementation of semantic ordering is optimal across all datasets; a test-time scaling law links sorting cost to ordering quality for comparison-based methods; therefore a budget-aware optimizer that combines heuristics, LLM-as-Judge scoring, and consensus aggregation can choose access paths whose resulting ranking accuracy is on par with or better than the best static choice on every benchmark tested.

What carries the argument

The budget-aware optimizer for the LLM ORDER BY operator, which applies heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to pick a near-optimal physical access path at runtime.

If this is right

  • Runtime selection removes the need to commit to one sorting algorithm before seeing the data or the budget.
  • Semantic ordering becomes practical for large tables because the merge-sort variant and the optimizer together control both cost and quality.
  • LLM-powered database systems can treat ordering as an optimizable operator rather than a fixed black-box step.
  • The same selection logic could be reused when other semantic operators are added to analytic pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar optimizers might later be applied to other LLM-based operators such as joins or filters.
  • The observed cost-quality scaling could be used to set budgets automatically in production query planners.
  • Extending the approach to streaming or incremental data would require only modest changes to the merge-sort component.

Load-bearing premise

That evaluations based on LLM-as-Judge with consensus aggregation produce stable, unbiased estimates of ordering quality that hold for datasets and models beyond those used in the study.

What would settle it

Apply the optimizer to a fresh dataset whose true ordering is known, run each static method to completion, and check whether the dynamically chosen path ever falls more than a small margin below the accuracy of the single best static method.

Figures

Figures reproduced from arXiv: 2509.00303 by Amr El Abbadi, Anupam Datta, Dimitris Tsirogiannis, Divyakant Agrawal, Fuheng Zhao, Jiayue Chen, Paritosh Aggarwal, Sohaib, Tahseen Rabbani, Yiming Pan.

Figure 1
Figure 1. Figure 1: Sorting accuracy vs. the monetary budget ($). While more budget generally leads to better accuracy, these figures [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy distribution across algorithms for the [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sorting accuracy vs. the monetary budget ($). [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Llama models’ accuracy distribution on the DL20 [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Limiting the optimizer budget to $6 and $70. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
read the original abstract

In this work, we present the \texttt{LLM ORDER BY} semantic operator as a logical abstraction and conduct a systematic study of its physical implementations. First, we propose several improvements to existing semantic sorting algorithms and introduce a semantic-aware external merge sort algorithm. Our extensive evaluation reveals that no single implementation offers universal optimality on all datasets. From our evaluations, we observe a general test-time scaling relationship between sorting cost and the ordering quality for comparison-based algorithms. Building on these insights, we design a budget-aware optimizer that utilizes heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to dynamically select the near-optimal access path for LLM ORDER BY. In our extensive evaluations, our optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks. We believe that this work provides foundational insights into the principled optimization of semantic operators essential for building robust, large-scale LLM-powered analytic systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the LLM ORDER BY semantic operator as a logical abstraction and systematically studies its physical implementations. It proposes improvements to existing semantic sorting algorithms along with a semantic-aware external merge sort, observes a general test-time scaling relationship between sorting cost and ordering quality for comparison-based methods, and designs a budget-aware optimizer that combines heuristic rules, LLM-as-Judge evaluation, and consensus aggregation to select near-optimal access paths. The central empirical claim is that this optimizer consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks.

Significance. If the empirical results hold under proper validation, the work offers foundational insights into principled optimization of semantic operators for LLM-powered analytic systems. The systematic study of access paths, the introduction of a semantic-aware external merge sort, and the identification of a scaling relationship between cost and quality are concrete strengths that could inform future database designs integrating large language models.

major comments (2)
  1. [Evaluation section] Evaluation section: The headline claim that the optimizer 'consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks' (abstract) rests on LLM-as-Judge evaluations with consensus aggregation, yet no correlation to human ground truth, inter-annotator agreement, or objective proxies (such as downstream query precision on labeled data) is reported. This assumption is load-bearing for the superiority result and risks circular reinforcement if the same judge mechanism influences both path selection and accuracy measurement.
  2. [Evaluation section] Evaluation section: The abstract reports high-level observations that no single method is universally optimal and that the optimizer matches or beats static baselines, but provides no quantitative error bars, dataset sizes, number of runs, or statistical tests. Without these, the reliability and generalizability of the central empirical claim cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract could specify the concrete benchmarks, models, and dataset scales used in the 'extensive evaluations' to help readers immediately gauge the scope of the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our evaluation methodology. We address each major comment below and will incorporate revisions to strengthen the empirical claims.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: The headline claim that the optimizer 'consistently achieves ranking accuracy on par with or superior to the best static methods across all benchmarks' (abstract) rests on LLM-as-Judge evaluations with consensus aggregation, yet no correlation to human ground truth, inter-annotator agreement, or objective proxies (such as downstream query precision on labeled data) is reported. This assumption is load-bearing for the superiority result and risks circular reinforcement if the same judge mechanism influences both path selection and accuracy measurement.

    Authors: We agree that the shared use of LLM-as-Judge for both optimizer path selection and final accuracy measurement introduces a risk of circular reinforcement. In the revised manuscript, we will add a dedicated validation subsection that reports correlation between LLM-as-Judge scores and human annotations on a sampled subset of queries from each benchmark. We will also report inter-annotator agreement statistics and, where possible, an objective proxy such as precision on a labeled downstream task. This addition will clarify the reliability of the judge mechanism independent of the optimizer. revision: yes

  2. Referee: [Evaluation section] Evaluation section: The abstract reports high-level observations that no single method is universally optimal and that the optimizer matches or beats static baselines, but provides no quantitative error bars, dataset sizes, number of runs, or statistical tests. Without these, the reliability and generalizability of the central empirical claim cannot be assessed.

    Authors: We acknowledge that the current presentation lacks the quantitative details necessary for assessing reliability. In the revised evaluation section and abstract, we will explicitly report dataset sizes, the number of independent runs per experiment, error bars (standard deviation or confidence intervals), and results of statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing the optimizer against the best static baseline on each benchmark. These additions will be placed in the main evaluation tables and text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives its budget-aware optimizer from empirical observations of test-time scaling relationships between sorting cost and ordering quality, obtained through evaluations of multiple semantic sorting implementations including a proposed semantic-aware external merge sort. The optimizer then applies heuristic rules, LLM-as-Judge evaluations, and consensus aggregation to select access paths. Reported ranking accuracy is presented as an outcome of extensive benchmark comparisons against static methods. No equations, self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text that would make any central claim equivalent to its inputs by construction. The derivation remains self-contained via experimental methodology and independent benchmark results rather than tautological reuse of fitted values or prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on empirical observations of scaling behavior and on the reliability of LLM-as-Judge plus consensus; no explicit free parameters or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5726 in / 1033 out tokens · 35957 ms · 2026-05-21T22:28:58.651783+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Code-Switching Information Retrieval: Benchmarks, Analysis, and the Limits of Current Retrievers

    cs.IR 2026-04 unverdicted novelty 7.0

    Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expans...

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    TruLens: Evals and Tracing for LLMs and Agents

    [n.d.]. TruLens: Evals and Tracing for LLMs and Agents. https://www.trulens. org/. Accessed: 2025-11-25

  2. [2]

    Paritosh Aggarwal, Bowei Chen, Anupam Datta, Benjamin Han, Boxin Jiang, Nitish Jindal, Zihan Li, Aaron Lin, Pawel Liskowski, Jay Tayade, Dimitris Tsirogiannis, Nathan Wiegand, and Weicheng Zhao. 2025. Cortex AISQL: A Production SQL Engine for Unstructured Data. arXiv:2511.07663 [cs.DB] https://arxiv.org/abs/2511.07663

  3. [3]

    Meta AI. 2024. Llama 3.1: Multilingual, Long-Context Large Language Models (8 B, 70 B, 405 B). https://ai.meta.com/blog/meta-llama-3-1/. Accessed: 2025-11-24

  4. [4]

    Ashwin Alaparthi, Paul Loh, and Ryan Marcus. 2025. ScaleLLM: A Technique for Scalable LLM-augmented Data Systems. InCompanion of the 2025 International Conference on Management of Data. 11–14

  5. [5]

    Amazon Web Services. 2024. Bringing Generative AI to the Data Warehouse with Amazon Bedrock and Amazon Redshift. https: //repost.aws/articles/ARJszlMEepRti6xoM-0fsBmw/bringing-generative- ai-to-the-data-warehouse-with-amazon-bedrock-and-amazon-redshift AWS re:Post article; accessed: 2025-08-17

  6. [6]

    Francesco Barbieri, Jose Camacho-Collados, Luis Espinosa-Anke, and Leonardo Neves. 2020. TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification. InFindings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 1644–1650. https: //doi.org/10.18653/v1/2020.findings-emnlp.148

  7. [7]

    BerriAI. 2025. LiteLLM: Python SDK and proxy server for calling 100+ LLM APIs. GitHub repository. https://github.com/BerriAI/litellm Accessed: 2025-11-28

  8. [8]

    Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. 2021. When is memorization of irrelevant training data necessary for high-accuracy learning?. InProceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing (STOC ’21). ACM, 123–132. https://doi.org/10.1145/3406325.3451131

  9. [9]

    Yu Chen, Ke Yi, Jun Zhang, and Guoliang Li. 2006. Two-Level Sampling for Join Size Estimation. InProceedings of the 2006 ACM SIGMOD International Conference on Management of Data (SIGMOD ’06). ACM, Chicago, Illinois, USA, 759–770. https://doi.org/10.1145/1142473.1142571

  10. [10]

    Zhoujun Cheng, Jungo Kasai, and Tao Yu. 2023. Batch Prompting: Efficient Inference with Large Language Model APIs. arXiv:2301.08721 [cs.CL] https: //arxiv.org/abs/2301.08721

  11. [11]

    Voorhees

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, and Ellen M. Voorhees. 2020. Overview of the TREC 2019 deep learning track.CoRR abs/2003.07820 (2020). arXiv:2003.07820 https://arxiv.org/abs/2003.07820

  12. [12]

    Voorhees, and Ian Soboroff

    Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Ellen M. Voorhees, and Ian Soboroff. 2021. TREC Deep Learning Track: Reusable Test Collections in the Large Data Regime. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’21). ACM, 2369–2375. https://doi.org/10.1145/3404835.3463249

  13. [13]

    Andrew Drozdov, Honglei Zhuang, Zhuyun Dai, Zhen Qin, Razieh Rahimi, Xu- anhui Wang, Dana Alon, Mohit Iyyer, Andrew McCallum, Donald Metzler, et al

  14. [14]

    arXiv preprint arXiv:2310.14408(2023)

    Parade: Passage ranking using demonstrations with large language models. arXiv preprint arXiv:2310.14408(2023)

  15. [15]

    Peter Emerson. 2013. The original Borda count and partial voting.Social Choice and Welfare40, 2 (2013), 353–358. https://doi.org/10.1007/s00355-011-0603-9

  16. [16]

    Avrilia Floratou, Fotis Psallidas, Fuheng Zhao, Shaleen Deep, Gunther Hagleither, Wangda Tan, Joyce Cahoon, Rana Alotaibi, Jordan Henkel, Abhik Singla, et al

  17. [17]

    Nl2sql is a solved problem... not!. InCIDR

  18. [18]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2023. Text-to-sql empowered by large language models: A benchmark evaluation.arXiv preprint arXiv:2308.15363(2023)

  19. [19]

    Gibbons and Yossi Matias

    Phillip B. Gibbons and Yossi Matias. 1998. New Sampling-Based Summary Statistics for Improving Approximate Query Answers. InProceedings of the 1998 ACM SIGMOD International Conference on Management of Data (SIGMOD ’98). ACM, Seattle, Washington, USA, 331–342. https://doi.org/10.1145/276304.276346

  20. [20]

    Parker Glenn, Parag Pravin Dakle, Liang Wang, and Preethi Raghavan. 2024. Blendsql: A scalable dialect for unifying hybrid question answering in relational algebra.arXiv preprint arXiv:2402.17882(2024)

  21. [21]

    Yue Gong, Chuan Lei, Xiao Qin, Kapil Vaidya, Balakrishnan Narayanaswamy, and Tim Kraska. 2025. SQLens: An End-to-End Framework for Error Detection and Correction in Text-to-SQL.arXiv preprint arXiv:2506.04494(2025)

  22. [22]

    Google Cloud. 2025. Introduction to AI and ML in BigQuery. https://cloud. google.com/bigquery/docs/bqml-introduction Accessed: 2025-08-17

  23. [23]

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo. 2025. A Survey on LLM-as- a-Judge. arXiv:2411.15594 [cs.CL] https://arxiv.org/abs/2411.15594

  24. [24]

    Zijian He, Reyna Abhyankar, Vikranth Srivatsa, and Yiying Zhang. 2025. Cognify: Supercharging gen-ai workflows with hierarchical autotuning. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 932–943

  25. [25]

    Steven Heilman. 2022. Noise Stability of Ranked Choice Voting. arXiv:2209.11183 [cs.GT] https://arxiv.org/abs/2209.11183

  26. [26]

    Daomin Ji, Hui Luo, Zhifeng Bao, and Shane Culpepper. 2025. Table integration in data lakes unleashed: pairwise integrability judgment, integrable set discovery, and multi-tuple conflict resolution.The VLDB Journal34, 36 (2025). https: //doi.org/10.1007/s00778-025-00917-9

  27. [27]

    Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

  28. [28]

    Heegyu Kim, Taeyang Jeon, Seunghwan Choi, Seungtaek Choi, and Hyunsouk Cho. 2024. FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to- SQL Benchmark. arXiv:2409.19014 [cs.CL] https://arxiv.org/abs/2409.19014

  29. [29]

    Donald E. Knuth. 1997.The Art of Computer Programming(3 ed.). Vol. 1. Addison- Wesley, Reading, MA

  30. [30]

    Jiale Lao, Andreas Zimmerer, Olga Ovcharenko, Tianji Cong, Matthew Russo, Gerardo Vitagliano, Michael Cochez, Fatma Özcan, Gautam Gupta, Thibaud Hottelier, H. V. Jagadish, Kris Kissel, Sebastian Schelter, Andreas Kipf, and Im- manuel Trummer. 2025. SemBench: A Benchmark for Semantic Query Processing Engines. arXiv:2511.01716 [cs.DB] https://arxiv.org/abs/...

  31. [31]

    Dawei Li, Zhen Tan, Chengshuai Zhao, Bohan Jiang, Baixiang Huang, Pingchuan Ma, Abdullah Alnaibari, Kai Shu, and Huan Liu. 2024. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge.arXiv preprint abs/2411.16594 (2024). https://arxiv.org/abs/2411.16594

  32. [32]

    Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al . 2023. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems36 (2023), 42330–42357

  33. [33]

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michi- hiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al

  34. [34]

    Holistic evaluation of language models.arXiv preprint arXiv:2211.09110 (2022)

  35. [35]

    Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. Pyserini: A Python Toolkit for Reproducible Infor- mation Retrieval Research with Sparse and Dense Representations. InProceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR)

  36. [36]

    Parameswaran

    Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G. Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents. arXiv:2501.06659 [cs.DB] https://arxiv.org/abs/ 2501.06659

  37. [37]

    Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baile Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, Rana Shahout, et al. 2025. Palimpzest: Optimizing ai-powered analytics with declarative query processing. InProceedings of the Conference on Innovative Database Research (CIDR). 2

  38. [38]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172

  39. [39]

    Jian Luo, Xuanang Chen, Ben He, and Le Sun. 2024. Prp-graph: Pairwise rank- ing prompting to llms with graph aggregation for effective text re-ranking. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 5766–5776

  40. [40]

    Xueguang Ma, Xinyu Zhang, Ronak Pradeep, and Jimmy Lin. 2023. Zero- shot listwise document reranking with a large language model.arXiv preprint arXiv:2305.02156(2023)

  41. [41]

    Conrado Martínez. 2004. Partial Quicksort. InProceedings of the 6th Workshop on Algorithm Engineering and Experiments and the 1st Workshop on Analytic Algorithmics and Combinatorics (ALENEX/ANALCO). SIAM, New Orleans, LA, USA, 1–8

  42. [42]

    Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G

    John X. Morris, Chawin Sitawarin, Chuan Guo, Narine Kokhlikyan, G. Edward Suh, Alexander M. Rush, Kamalika Chaudhuri, and Saeed Mahloujifar. 2025. How much do language models memorize? arXiv:2505.24832 [cs.CL] https: //arxiv.org/abs/2505.24832

  43. [43]

    Rodrigo Nogueira, Zhiying Jiang, and Jimmy Lin. 2020. Document Ranking with a Pretrained Sequence-to-Sequence Model. arXiv:2003.06713 [cs.IR] https: //arxiv.org/abs/2003.06713

  44. [44]

    OpenAI. 2024. OpenAI FAQ: How should I set the temperature parameter? https://platform.openai.com/docs/faq/faq. Accessed: 2025-11-30

  45. [45]

    OpenAI. 2024. Structured model outputs. OpenAI API Guide. https://platform. openai.com/docs/guides/structured-outputs Structured outputs ensure model responses adhere to a supplied JSON Schema. Accessed: 2025-08-25

  46. [46]

    OpenIntro. 2025. NBA Player Heights (2008–09 Season). R package openintro, dataset nba_heights. Available at https://www.openintro.org/data/index.php? data=nba_heights, accessed 2025-08-25

  47. [47]

    Christos Chrysovalantis Papadopoulos, Alkis Simitsis, and Torben Bach Pedersen

  48. [48]

    InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE)

    HAIDES: Adaptive Approximation of Inference Queries over Unstructured Data. InProceedings of the 41st IEEE International Conference on Data Engineering (ICDE). 2394–2407

  49. [49]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured 13 data.arXiv e-prints(2024), arXiv–2407

  50. [50]

    Tanu Prabhu. 2020. Population by Country — 2020. https://www.kaggle.com/ datasets/tanuprabhu/population-by-country-2020. Accessed: 2025-11-24

  51. [51]

    Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. 2023. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088(2023)

  52. [52]

    Zhen Qin, Rolf Jagerman, Kai Hui, Honglei Zhuang, Junru Wu, Le Yan, Jiaming Shen, Tianqi Liu, Jialu Liu, Donald Metzler, et al. 2023. Large language models are effective text rankers with pairwise ranking prompting.arXiv preprint arXiv:2306.17563(2023)

  53. [53]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  54. [54]

    Donald G. Saari. 2023. Selecting a Voting Method: The Case for the Borda Count. Constitutional Political Economy34, 3 (2023), 357–366. https://doi.org/10.1007/ s10602-022-09380-y

  55. [55]

    Devendra Singh Sachan, Mike Lewis, Mandar Joshi, Armen Aghajanyan, Wen- tau Yih, Joelle Pineau, and Luke Zettlemoyer. 2022. Improving passage retrieval with zero-shot question generation.arXiv preprint arXiv:2204.07496(2022)

  56. [56]

    Dario Satriani, Enzo Veltri, Donatello Santoro, Sara Rosato, Simone Varriale, and Paolo Papotti. 2025. Logical and Physical Optimizations for SQL Query Execution over Large Language Models.Proceedings of the ACM on Management of Data3, 3 (2025), 1–28

  57. [57]

    P Griffiths Selinger, Morton M Astrahan, Donald D Chamberlin, Raymond A Lorie, and Thomas G Price. 1979. Access path selection in a relational database management system. InProceedings of the 1979 ACM SIGMOD international conference on Management of data. 23–34

  58. [58]

    Nihar B Shah and Martin J Wainwright. 2018. Simple, robust and optimal ranking from pairwise comparisons.Journal of machine learning research18, 199 (2018), 1–38

  59. [59]

    Shreya Shankar, Tristan Chambers, Tarak Shah, Aditya G Parameswaran, and Eugene Wu. 2024. Docetl: Agentic query rewriting and evaluation for complex document processing.arXiv preprint arXiv:2410.12189(2024)

  60. [60]

    Snowflake, Inc. 2025. Snowflake Cortex AISQL (including LLM functions). https://docs.snowflake.com/user-guide/snowflake-cortex/aisql?lang=de/ Pre- view feature documentation; accessed: 2025-08-17

  61. [61]

    Ji Sun, Guoliang Li, Peiyao Zhou, Yihui Ma, Jingzhe Xu, and Yuan Li. 2025. AgenticData: An Agentic Data Analytics System for Heterogeneous Data.arXiv preprint arXiv:2508.05002(2025)

  62. [62]

    Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. 2023. Is ChatGPT good at search? investigat- ing large language models as re-ranking agents.arXiv preprint arXiv:2304.09542 (2023)

  63. [63]

    Zhaoze Sun, Qiyan Deng, Chengliang Chai, Kaisen Jin, Xinyu Guo, Han Han, Ye Yuan, Guoren Wang, and Lei Cao. 2025. QUEST: Query Optimization in Unstructured Document Analysis. InProceedings of the VLDB Endowment

  64. [64]

    Immanuel Trummer. 2025. Implementing Semantic Join Operators Efficiently. arXiv:2510.08489 [cs.DB] https://arxiv.org/abs/2510.08489

  65. [65]

    Hongtao Wang, Taiyan Zhang, Renchi Yang, and Jianliang Xu. 2025. Cequel: Cost-Effective Querying of Large Language Models for Text Clustering. InPro- ceedings of the 34th ACM International Conference on Information and Knowl- edge Management (CIKM). Association for Computing Machinery, 2998–3008. https://doi.org/10.1145/3746252.3761074

  66. [66]

    Jiayi Wang, Yuan Li, Jianming Wu, Shihui Xu, and Guoliang Li. 2025. Unify: A System For Unstructured Data Analytics.Proceedings of the VLDB Endowment 18, 12 (2025), 5287–5290. https://doi.org/10.14778/3750601.3750653

  67. [67]

    Xinyi Wang, Antonis Antoniades, Yanai Elazar, Alfonso Amayuelas, Alon Al- balak, Kexun Zhang, and William Yang Wang. 2025. Generalization v.s. Mem- orization: Tracing Language Models’ Capabilities Back to Pretraining Data. arXiv:2407.14985 [cs.CL] https://arxiv.org/abs/2407.14985

  68. [68]

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG type ranking measures. InConference on learning theory. PMLR, 25–54

  69. [69]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.Advances in neural information processing systems 35 (2022), 24824–24837

  70. [70]

    Rui Wen, Zheng Li, Michael Backes, and Yang Zhang. 2024. Membership Inference Attacks Against In-Context Learning. InProceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS ’24). ACM, Salt Lake City, UT, USA, 3481–3495. https://doi.org/10.1145/3658644.3690306

  71. [71]

    Sampling-Based Query Re-Optimization

    Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. arXiv:1601.05748 [cs.DB] https://arxiv.org/abs/1601. 05748

  72. [72]

    Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, and Wenhu Chen. 2025. StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs. arXiv:2505.20139 [cs.SE]...

  73. [73]

    Victor Zakhary, Lawrence Lim, Divyakant Agrawal, and Amr El Abbadi. 2020. CoT: Decentralized elastic caches for cloud environments.arXiv preprint arXiv:2006.08067(2020)

  74. [74]

    Sirui Zeng and Xifeng Yan. 2025. ADL: A Declarative Language for Agent-Based Chatbots.arXiv preprint arXiv:2504.14787(2025)

  75. [75]

    Fuheng Zhao, Divyakant Agrawal, and Amr El Abbadi. 2024. Hybrid query- ing over relational databases and large language models.arXiv preprint arXiv:2408.00884(2024)

  76. [76]

    Fuheng Zhao, Jiayue Chen, Lawrence Lim, Ishtiyaque Ahmad, Divyakant Agrawal, and Amr El Abbadi. 2023. Llm-sql-solver: Can llms determine SQL equivalence?arXiv preprint arXiv:2312.10321(2023)

  77. [77]

    Fuheng Zhao, Shaleen Deep, Fotis Psallidas, Avrilia Floratou, Divyakant Agrawal, and Amr El Abbadi. 2024. Sphinteract: Resolving Ambiguities in NL2SQL through User Interaction.Proceedings of the VLDB Endowment18, 4 (2024), 1145–1158

  78. [78]

    Zhanhao Zhao, Shaofeng Cai, Haotian Gao, Hexiang Pan, Siqi Xiang, Naili Xing, Gang Chen, Beng Chin Ooi, Yanyan Shen, Yuncheng Wu, and Meihui Zhang. 2025. NeurDB: On the Design and Implementation of an AI-powered Autonomous Database. arXiv:2408.03013 [cs.DB] https://arxiv.org/abs/2408.03013

  79. [79]

    Lixi Zhou, Qi Lin, Kanchan Chowdhury, Saif Masood, Alexandre Eichenberger, Hong Min, Alexander Sim, Jie Wang, Yida Wang, Kesheng Wu, et al. 2024. Serv- ing Deep Learning Models from Relational Databases.Advances in Database Technology-EDBT27, 3 (2024), 717–724

  80. [80]

    Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Haonan Chen, Zheng Liu, Zhicheng Dou, and Ji-Rong Wen. 2023. Large lan- guage models for information retrieval: A survey.arXiv preprint arXiv:2308.07107 (2023)

Showing first 80 references.