pith. sign in

arxiv: 2603.09181 · v2 · submitted 2026-03-10 · 💻 cs.DB

Evaluating the Practical Effectiveness of LLM-Driven Index Tuning with Microsoft Database Tuning Advisor

Pith reviewed 2026-05-15 13:54 UTC · model grok-4.3

classification 💻 cs.DB
keywords index tuningLLMDatabase Tuning AdvisorDTAquery optimizationexecution timeperformance tuningSQL Server
0
0 comments X

The pith

LLM-driven index tuning can find configurations that significantly outperform DTA in execution time for a considerable number of cases, though DTA is generally more reliable.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates large language model approaches to recommending database indexes against Microsoft's Database Tuning Advisor (DTA) using both standard industrial benchmarks and real enterprise customer workloads. It measures success by actual query execution time after applying the suggested indexes. The results show DTA produces more consistent recommendations overall, but LLMs can discover superior index sets in many cases after only a few attempts. LLM reasoning often mirrors intuitive human judgments about index choices, which could be extracted to strengthen traditional methods. Direct production use faces barriers from high performance variance, weak gains when fused with DTA, and expensive validation steps.

Core claim

Although DTA is generally more reliable, with a few invocations, LLM can identify configurations that significantly outperform those found by DTA in execution time in a considerable number of cases, highlighting its potential as a complementary technique. We also observe that LLM's reasoning captures human-intuitive insights that may be distilled to potentially improve DTA. However, adopting LLM-driven index tuning in production remains challenging due to its substantial performance variance, limited and often negative impact when directly integrated into DTA, and the high cost of performance validation.

What carries the argument

Comparison of index configurations from DTA's what-if API cost estimates versus LLM suggestions, validated through actual query execution times on benchmarks and customer workloads.

Load-bearing premise

The tested benchmarks and real-world customer workloads are representative, and measured execution time improvements accurately reflect production benefits without unaccounted confounding factors like hardware variation.

What would settle it

Repeating the experiments on a fresh collection of customer workloads and finding that LLM fails to produce outperforming configurations in a comparable fraction of cases would falsify the claim.

Figures

Figures reproduced from arXiv: 2603.09181 by Surajit Chaudhuri, Vivek Narasayya, Wentao Wu, Xiaoying Wang.

Figure 2
Figure 2. Figure 2: Prompt plan example (TPC-H q04, simplified). [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Single-query workload prompt template. 2 METHODOLOGY 2.1 Benchmark and Customer Workloads Unlike existing work that has been primarily evaluated using public benchmarks with synthetic data and queries, we focus on real-world customer workloads in our study [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-query workload prompt template. Remark. Our goal is not to identify an optimal prompt, but to provide LLM with basic information that a human expert would need to make informed index recommendations. In this study, we focus on evaluating the fundamental capability of LLM for index tuning, given only rudimentary information. More variations of LLM-driven index tuning are discussed in Section 7. 2.3.2 … view at source ↗
Figure 4
Figure 4. Figure 4: LLM-driven index tuning (best) vs. DTA for tuning single-query workloads (marker denotes the worst run of LLM). [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of index usage between LLM-driven index tuning and DTA for tuning single-query workloads. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of the estimated costs between LLM [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of GPT-5’s reasoning process. In this section, we investigate this question in detail. We start by examining the underlying reasoning processes of GPT-5 for tuning single-query workloads. We observe that GPT-5’s reasoning follows several intuitive insights that align with human judgment and can be summarized into a few rules of thumb. Based on this observation, we explore whether these insights ca… view at source ↗
Figure 8
Figure 8. Figure 8: Evaluation of the simple index tuner (y-axis: [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: LLM-driven index tuning (five invocations shown as bars) vs. DTA (red line) for tuning multi-query workloads. [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of Real-D 𝐾 = 10. LLM’s indexes occupy only 7.7 GB. This shows that LLM does not achieve its performance by inflating the storage, indicating that its recommendation is indeed superior in this case. Per-query Improvement. Beyond total execution time, we also evaluated the benefit of the index recommendations at the per￾query level [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Time breakdown for performance validation. [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗
read the original abstract

Index tuning is critical for the performance of modern database systems. Industrial index tuners, such as the Database Tuning Advisor (DTA) developed for Microsoft SQL Server, rely on the "what-if" API provided by the query optimizer to estimate the cost of a query given an index configuration, which can lead to suboptimal recommendations when the estimations are inaccurate. Large language model (LLM) offers a new approach to index tuning, with knowledge learned from web-scale training datasets. However, the effectiveness of LLM-driven index tuning, especially beyond what is already achieved by commercial index tuners, remains unclear. In this paper, we study the practical effectiveness of LLM-driven index tuning using both industrial benchmarks and real-world enterprise customer workloads, and compare it with DTA. Our results show that although DTA is generally more reliable, with a few invocations, LLM can identify configurations that significantly outperform those found by DTA in execution time in a considerable number of cases, highlighting its potential as a complementary technique. We also observe that LLM's reasoning captures human-intuitive insights that may be distilled to potentially improve DTA. However, adopting LLM-driven index tuning in production remains challenging due to its substantial performance variance, limited and often negative impact when directly integrated into DTA, and the high cost of performance validation. This work provides motivation, lessons, and practical insights that will inspire future work on LLM-driven index tuning both in academia and industry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper evaluates the practical effectiveness of LLM-driven index tuning compared to Microsoft's Database Tuning Advisor (DTA) using industrial benchmarks and real-world enterprise customer workloads. It claims that while DTA is generally more reliable, LLMs can with a few invocations identify index configurations that significantly outperform DTA in execution time in a considerable number of cases; LLM reasoning also captures human-intuitive insights that could improve DTA, though production adoption is challenged by performance variance, limited integration benefits, and high validation costs.

Significance. If the results hold under rigorous controls, the work is significant for providing empirical evidence on LLM-based index tuning as a potential complement to commercial tools like DTA. The inclusion of real-world customer workloads is a notable strength, offering practical lessons on variance and adoption barriers that could guide hybrid tuning systems in database research.

major comments (2)
  1. [§4] §4 (Experimental Evaluation): The comparative results on execution-time outperformance lack any description of measurement protocols, including whether runs were repeated, cold-cache conditions were enforced, statistical significance was tested, error bars reported, or controls applied for buffer-pool state, concurrent load, and hardware variation. This directly undermines the central claim that LLM configurations significantly outperform DTA, as unaccounted noise could explain the reported wins.
  2. [§5] §5 (Discussion of Insights): The assertion that LLM reasoning captures human-intuitive insights lacks concrete examples from the workloads or a quantified analysis of how these could be distilled to improve DTA, rendering the complementarity argument anecdotal rather than evidence-based.
minor comments (1)
  1. [Abstract] Abstract: The term 'considerable number of cases' is imprecise; stating the exact fraction or count of workloads where LLM outperforms DTA would strengthen clarity without altering the narrative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas for improving the clarity and rigor of our experimental and discussion sections. We address each major comment below and will revise the manuscript to incorporate the suggested enhancements.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Evaluation): The comparative results on execution-time outperformance lack any description of measurement protocols, including whether runs were repeated, cold-cache conditions were enforced, statistical significance was tested, error bars reported, or controls applied for buffer-pool state, concurrent load, and hardware variation. This directly undermines the central claim that LLM configurations significantly outperform DTA, as unaccounted noise could explain the reported wins.

    Authors: We agree that the original manuscript did not provide sufficient detail on the measurement protocols, which is a valid concern for validating the execution-time comparisons. In the revised version, we will expand the experimental setup subsection in §4 to explicitly describe: (1) each configuration was executed five times with the buffer pool flushed between runs to enforce cold-cache conditions; (2) statistical significance was evaluated using paired t-tests (p < 0.05); (3) error bars in figures represent one standard deviation; and (4) controls included dedicated hardware with no concurrent workloads and fixed server configurations to minimize variation. These additions will strengthen the reliability of the outperformance claims without altering the reported results. revision: yes

  2. Referee: [§5] §5 (Discussion of Insights): The assertion that LLM reasoning captures human-intuitive insights lacks concrete examples from the workloads or a quantified analysis of how these could be distilled to improve DTA, rendering the complementarity argument anecdotal rather than evidence-based.

    Authors: We acknowledge that the discussion in §5 would benefit from greater specificity. In the revision, we will include concrete examples drawn from the enterprise customer workloads, such as cases where the LLM recommended covering indexes for multi-column join patterns that aligned with common DBA practices but were not selected by DTA's cost model. However, a full quantified analysis of distilling these insights into modifications for DTA would require new experiments and implementation work that extends beyond the current study's scope; we will explicitly note this limitation and position it as an avenue for future research to make the complementarity argument more rigorous. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation relies on external benchmarks and workloads

full rationale

This paper is a pure empirical study comparing LLM index tuning against DTA on industrial benchmarks and real customer workloads. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. All claims rest on direct experimental measurements against external data sources, with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation paper with no mathematical derivations, fitted parameters, or new postulated entities.

pith-pipeline@v0.9.0 · 5566 in / 1116 out tokens · 52808 ms · 2026-05-15T13:54:45.820888+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 5 internal anchors

  1. [1]

    Microsoft SQL Server Missing Indexes

    2025. Microsoft SQL Server Missing Indexes. https://learn.microsoft.com/en- us/sql/relational-databases/indexes/tune-nonclustered-missing-index- suggestions?view=sql-server-ver17

  2. [2]

    Marathe, Vivek R

    Sanjay Agrawal, Surajit Chaudhuri, Lubor Kollár, Arunprasad P. Marathe, Vivek R. Narasayya, and Manoj Syamala. 2004. Database Tuning Advisor for Microsoft SQL Server 2005. InVLDB. 1110–1121

  3. [3]

    Narasayya

    Sanjay Agrawal, Surajit Chaudhuri, and Vivek R. Narasayya. 2001. Materialized View and Index Selection Tool for Microsoft SQL Server 2000. InSIGMOD

  4. [4]

    Dana Van Aken et al. 2021. An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems. Proc. VLDB Endow.14, 7 (2021), 1241–1253

  5. [5]

    Gordon, and Bohan Zhang

    Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. 2017. Automatic Database Management System Tuning Through Large-scale Machine Learning. InSIGMOD. ACM, 1009–1024

  6. [6]

    Peter Akioyamen, Zixuan Yi, and Ryan Marcus. 2024. The Unreasonable Effec- tiveness of LLMs for Query Optimization.CoRRabs/2411.02862 (2024)

  7. [7]

    Matteo Brucato, Tarique Siddiqui, Wentao Wu, Vivek Narasayya, and Surajit Chaudhuri. 2024. Wred: Workload Reduction for Scalable Index Tuning.Proc. ACM Manag. Data2, 1, Article 50 (2024), 26 pages

  8. [8]

    Nicolas Bruno and Surajit Chaudhuri. 2005. Automatic Physical Database Tuning: A Relaxation-based Approach. InSIGMOD. 227–238

  9. [9]

    Stefano Cereda et al. 2021. CGPTuner: a Contextual Gaussian Process Bandit Ap- proach for the Automatic Tuning of IT Configurations Under Varying Workload Conditions.Proc. VLDB Endow.14, 8 (2021), 1401–1413

  10. [10]

    Sunil Chakkappen et al. 2025. Automatic Indexing in Oracle.Proc. VLDB Endow. 18, 12 (2025), 4924–4937

  11. [11]

    Narasayya

    Surajit Chaudhuri, Ashish Kumar Gupta, and Vivek R. Narasayya. 2002. Com- pressing SQL workloads. InSIGMOD. 488–499

  12. [12]

    Surajit Chaudhuri and Vivek Narasayya. 2020. Anytime Algorithm of Database Tuning Advisor for Microsoft SQL Server

  13. [13]

    Narasayya

    Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven Index Selection Tool for Microsoft SQL Server. InVLDB. 146–155

  14. [14]

    Narasayya

    Surajit Chaudhuri and Vivek R. Narasayya. 1998. AutoAdmin ’What-if’ Index Analysis Utility. InSIGMOD. 367–378

  15. [15]

    Narasayya

    Surajit Chaudhuri and Vivek R. Narasayya. 1999. Index Merging. InICDE

  16. [16]

    Sudipto Das et al. 2019. Automatically Indexing Millions of Databases in Mi- crosoft Azure SQL Database. InSIGMOD. 666–679

  17. [17]

    Debabrata Dash, Neoklis Polyzotis, and Anastasia Ailamaki. 2011. CoPhy: A Scalable, Portable, and Interactive Index Advisor for Large Workloads.Proc. VLDB Endow.4, 6 (2011), 362–372

  18. [18]

    Naughton, and Stratis Viglas

    Shaleen Deep, Anja Gruenheid, Paraschos Koutris, Jeffrey F. Naughton, and Stratis Viglas. 2020. Comprehensive and Efficient Workload Compression.Proc. VLDB Endow.14, 3 (2020), 418–430

  19. [19]

    Sriram Dharwada, Himanshu Devrani, Jayant Haritsa, and Harish Doraiswamy

  20. [20]

    Query rewriting via llms.arXiv preprint arXiv:2502.12918(2025)

  21. [21]

    Narasayya

    Bailu Ding, Sudipto Das, Ryan Marcus, Wentao Wu, Surajit Chaudhuri, and Vivek R. Narasayya. 2019. AI Meets AI: Leveraging Query Executions to Improve Index Recommendations. InSIGMOD. 1241–1258

  22. [22]

    Narasayya

    Bailu Ding, Sudipto Das, Wentao Wu, Surajit Chaudhuri, and Vivek R. Narasayya

  23. [23]

    VLDB Endow.11, 10 (2018), 1123–1136

    Plan Stitch: Harnessing the Best of Many Plans.Proc. VLDB Endow.11, 10 (2018), 1123–1136

  24. [24]

    Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Babu Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. 2025. Context Length Alone Hurts LLM Performance Despite Perfect Retrieval. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carol...

  25. [25]

    Songyun Duan et al . 2009. Tuning Database Configuration Parameters with iTuned.Proc. VLDB Endow.2, 1 (2009), 1246–1257

  26. [26]

    Victor Giannakouris and Immanuel Trummer. 2024. DBG-PT: A Large Language Model Assisted Query Performance Regression Debugger.Proc. VLDB Endow.17, 12 (2024), 4337–4340. https://www.vldb.org/pvldb/vol17/p4337-giannakouris.pdf

  27. [27]

    Victor Giannakouris and Immanuel Trummer. 2025. 𝜆-Tune: Harnessing Large Language Models for Automated Database System Tuning.Proc. ACM Manag. Data3, 1 (2025), 2:1–2:26

  28. [28]

    Goetz Graefe. 1995. The Cascades Framework for Query Optimization.IEEE Data Eng. Bull.18, 3 (1995), 19–29

  29. [29]

    Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. 2024. MiniLLM: Knowledge Distillation of Large Language Models. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=5h0qf7IBZZ

  30. [30]

    Beliz Gunel, Jingfei Du, Alexis Conneau, and Veselin Stoyanov. 2021. Supervised Contrastive Learning for Pre-trained Language Model Fine-tuning. In9th Inter- national Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=cu7IUiOhujH

  31. [31]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  32. [32]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  33. [33]

    Ioannidis and Stavros Christodoulakis

    Yannis E. Ioannidis and Stavros Christodoulakis. 1991. On the Propagation of Errors in the Size of Join Results. InSIGMOD. 268–277

  34. [34]

    Kunal Jha, Wilka Carvalho, Yancheng Liang, Simon Shaolei Du, Max Kleiman- Weiner, and Natasha Jaques. 2025. Cross-environment Cooperation Enables Zero-shot Multi-agent Coordination. InICML

  35. [35]

    Andrew Kane. 2017. Introducing Dexter, the Automatic Indexer for Post- gres. https://medium.com/@ankane/introducing-dexter-the-automatic-indexer- for-postgres-5f8fa8b28f27

  36. [36]

    Konstantinos Kanellis et al. 2022. LlamaTune: Sample-Efficient DBMS Configu- ration Tuning.Proc. VLDB Endow.15, 11 (2022), 2953–2965

  37. [37]

    Jan Kossmann, Stefan Halfpap, Marcel Jankrift, and Rainer Schlosser. 2020. Magic mirror in my hand, which is the best in the land? An Experimental Evaluation of Index Selection Algorithms.Proc. VLDB Endow.13, 11 (2020), 2382–2395

  38. [38]

    Jiale Lao et al. 2025. GPTuner: An LLM-Based Database Tuning System.SIGMOD Rec.54, 1 (2025), 101–110

  39. [39]

    Viktor Leis et al. 2015. How Good Are Query Optimizers, Really?PVLDB9, 3 (2015), 204–215

  40. [40]

    Ao Li, Yuexiang Xie, Songze Li, Fugee Tsung, Bolin Ding, and Yaliang Li. 2025. Agent-Oriented Planning in Multi-Agent Systems. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net. https://openreview.net/forum?id=EqcLAU6gyU

  41. [41]

    Guoliang Li, Xuanhe Zhou, Shifu Li, and Bo Gao. 2019. QTune: A Query-Aware Database Tuning System with Deep Reinforcement Learning.Proc. VLDB Endow. 12, 12 (2019), 2118–2130

  42. [42]

    Zhaodonghui Li, Haitao Yuan, Jiachen Shi, Hao Zhang, Yu Rong, and Gao Cong

  43. [43]

    MAAdvisor: Zero-Shot Index Advisor using Multi-Agent LLMs.CoRR abs/2508.16044 (2025)

  44. [44]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing

  45. [45]

    VLDB Endow.18, 1 (Sept

    LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency.Proc. VLDB Endow.18, 1 (Sept. 2024), 53–65. https://doi.org/10.14778/3696435.3696440

  46. [46]

    Wan Shen Lim, Lin Ma, William Zhang, Matthew Butrovich, Samuel Arch, and Andrew Pavlo. 2024. Hit the gym: accelerating query execution to efficiently bootstrap behavior models for self-driving database management systems.Pro- ceedings of the VLDB Endowment17, 11 (2024), 3680–3693

  47. [47]

    Jie Liu and Barzan Mozafari. 2024. GenRewrite: Query Rewriting via Large Language Models.arXiv preprint arXiv:2403.09060(2024)

  48. [48]

    Guy Lohman. [n.d.]. Is Query Optimization a “Solved” Problem? http://wp. sigmod.org/?p=1075

  49. [49]

    Lin Ma, Bailu Ding, Sudipto Das, and Adith Swaminathan. 2020. Active Learning for ML Enhanced Database Systems. InProceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q....

  50. [50]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query Rewriting in Retrieval-Augmented Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 5303–5315. https://doi.org/10.1865...

  51. [51]

    Amir M. Mansourian, Rozhan Ahmadi, Masoud Ghafouri, Amir Mohammad Babaei, Elaheh Badali Golezani, Zeynab yasamani ghamchi, Vida Ramezanian, Alireza Taherian, Kimia Dinashi, Amirali Miri, and Shohreh Kasaei. 2025. A Comprehensive Survey on Knowledge Distillation.Transactions on Machine Learning Research(2025). https://openreview.net/forum?id=3cbJzdR78B

  52. [52]

    Ryan Marcus and Olga Papaemmanouil. 2019. Plan-Structured Deep Neural Network Models for Query Performance Prediction.Proc. VLDB Endow.12, 11 (2019), 1733–1746. https://doi.org/10.14778/3342263.3342646

  53. [53]

    Microsoft. 2026. Azure Virtual Machines. https://azure.microsoft.com/en-us/ products/virtual-machines

  54. [54]

    Narasayya and Surajit Chaudhuri

    Vivek R. Narasayya and Surajit Chaudhuri. 2026. Leveraging Query Opti- mizers to Verify the Soundness of LLM-based Query Rewrites for Real-World Workloads, and More. In16th Conference on Innovative Data Systems Research, CIDR 2026, Chaminade, CA, USA, January 18-21, 2026. www.cidrdb.org. https://vldb.org/cidrdb/2026/leveraging-query-optimizers-to-verify-t...

  55. [55]

    OpenAI. 2025. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card. pdf

  56. [56]

    Manning, Ste- fano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Ste- fano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your 13 Language Model is Secretly a Reward Model. InAdvances in Neural Infor- mation Processing Systems 36: Annual Conference on Neural Information Pro- cessing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, D...

  57. [57]

    Rainer Schlosser, Jan Kossmann, and Martin Boissier. 2019. Efficient Scalable Multi-attribute Index Selection Using Recursive Strategies. InICDE. 1238–1249

  58. [58]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  59. [59]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms.CoRRabs/1707.06347 (2017). arXiv:1707.06347 http://arxiv.org/abs/1707.06347

  60. [60]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024). https://doi.org/10.48550/ARXIV.2402.03300 arXiv:2402.03300

  61. [61]

    Chi, Nathanael Schärli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large Language Models Can Be Easily Distracted by Irrelevant Context. InProceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barba...

  62. [62]

    Tarique Siddiqui and Wentao Wu. 2023. ML-Powered Index Tuning: An Overview of Recent Progress and Open Challenges.SIGMOD Rec.52, 4 (2023), 19–30

  63. [63]

    Narasayya, and Surajit Chaudhuri

    Tarique Siddiqui, Wentao Wu, Vivek R. Narasayya, and Surajit Chaudhuri. 2022. DISTILL: Low-Overhead Data-Driven Techniques for Filtering and Costing In- dexes for Scalable Index Tuning.Proc. VLDB Endow.15, 10 (2022), 2019–2031

  64. [64]

    Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, and Yong Zhang. 2025. R-Bot: An LLM-based Query Rewrite System.Proc. VLDB Endow. 18, 12 (2025), 5031–5044

  65. [65]

    Jie Tan, Kangfei Zhao, Rui Li, Jeffrey Xu Yu, Chengzhi Piao, Hong Cheng, Helen Meng, Deli Zhao, and Yu Rong. 2025. Can Large Language Models Be Query Optimizer for Relational Databases?Proc. ACM Manag. Data3, 6 (2025), 1–28. https://doi.org/10.1145/3769771

  66. [66]

    Immanuel Trummer. 2024. DB-BERT: making database tuning tools "read" the manual.VLDB J.33, 4 (2024), 1085–1104

  67. [67]

    Gary Valentin et al. 2000. DB2 Advisor: An Optimizer Smart Enough to Recom- mend Its Own Indexes. InICDE. 101–110

  68. [68]

    Junxiong Wang et al. 2021. UDO: Universal Database Optimization using Rein- forcement Learning.Proc. VLDB Endow.14, 13 (2021), 3402–3414

  69. [69]

    Xiaoying Wang, Changbo Qu, Weiyuan Wu, Jiannan Wang, and Qingqing Zhou

  70. [70]

    VLDB Endow.14, 9 (2021), 1640–1654

    Are We Ready For Learned Cardinality Estimation?Proc. VLDB Endow.14, 9 (2021), 1640–1654

  71. [71]

    Narasayya, and Surajit Chaudhuri

    Xiaoying Wang, Wentao Wu, Vivek R. Narasayya, and Surajit Chaudhuri. 2025. Esc: An Early-Stopping Checker for Budget-aware Index Tuning.Proc. VLDB Endow.18, 5 (2025), 1278–1290

  72. [72]

    Narasayya, and Surajit Chaud- huri

    Xiaoying Wang, Wentao Wu, Chi Wang, Vivek R. Narasayya, and Surajit Chaud- huri. 2024. Wii: Dynamic Budget Reallocation In Index Tuning.Proc. ACM Manag. Data2, 3 (2024), 182

  73. [73]

    Kyu-Young Whang. 1985. Index Selection in Relational Databases. InFoundations of Data Organization. 487–500

  74. [74]

    Wentao Wu. 2025. Hybrid Cost Modeling for Reducing Query Performance Regression in Index Tuning.IEEE Trans. Knowl. Data Eng.37, 1 (2025), 379–391

  75. [75]

    Naughton

    Wentao Wu, Yun Chi, Shenghuo Zhu, Jun’ichi Tatemura, Hakan Hacigümüs, and Jeffrey F. Naughton. 2013. Predicting query execution time: Are optimizer cost models really unusable?. InICDE. 1081–1092

  76. [76]

    Narasayya, and Surajit Chaudhuri

    Wentao Wu, Anshuman Dutt, Gaoxiang Xu, Vivek R. Narasayya, and Surajit Chaudhuri. 2025. Understanding and Detecting Query Performance Regression in Practical Index Tuning: [Experiments & Analysis].Proc. ACM Manag. Data3, 6 (2025), 1–26

  77. [77]

    Naughton, and Harneet Singh

    Wentao Wu, Jeffrey F. Naughton, and Harneet Singh. 2016. Sampling-Based Query Re-Optimization. InSIGMOD. 1721–1736

  78. [78]

    Narasayya, Surajit Chaudhuri, and Philip A

    Wentao Wu, Chi Wang, Tarique Siddiqui, Junxiong Wang, Vivek R. Narasayya, Surajit Chaudhuri, and Philip A. Bernstein. 2022. Budget-aware Index Tuning with Reinforcement Learning. InSIGMOD. 1528–1541

  79. [79]

    Valluri, and Mohamed Zaït

    Ritwik Yadav, Satyanarayana R. Valluri, and Mohamed Zaït. 2023. AIM: A practical approach to automated index management for SQL databases. InICDE. 3349–3362

  80. [80]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

Showing first 80 references.