pith. machine review for the scientific record. sign in

arxiv: 2604.06804 · v1 · submitted 2026-04-08 · 💻 cs.DB

Recognition: unknown

LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:10 UTC · model grok-4.3

classification 💻 cs.DB
keywords SQL rewritingquery optimizationsmall language modelsMCTS data generationGRPOdatabase systemsSQL-GRPOzero-shot transfer
0
0 comments X

The pith

LASER trains small language models on MCTS-generated slow queries using SQL-GRPO to rewrite SQL for better execution efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Query rewriting improves database performance by converting slow queries into faster equivalents without changing results. Existing rule-based tools are inflexible and large language models are costly to run and raise privacy concerns. LASER solves this by first building a dataset of challenging slow queries through Monte Carlo tree search that mixes rules and model mutations to create realistic performance issues. It then uses a tailored policy optimization method to teach small models how to spot and apply latency-reducing transformations. The result is compact models that run queries faster than either traditional methods or big models, transfer to new scenarios without extra training, and add little overhead.

Core claim

By constructing SQL-MCTS, a large-scale corpus of complex slow queries evolved from seeds using rule-guided anti-patterns and LLM mutations, and applying SQL-GRPO with anchored group advantage and complexity-adaptive dynamic rollouts, small models can autonomously learn execution-verified rewriting patterns that deliver superior efficiency and robust zero-shot transferability.

What carries the argument

SQL-GRPO, an adaptation of Group Relative Policy Optimization that integrates Anchored Group Advantage for refined estimation and Complexity-Adaptive Dynamic Rollout for efficient exploration to teach latency-aware rewriting.

If this is right

  • Compact models outperform rule-based systems and LLMs on execution efficiency for rewritten queries.
  • Zero-shot transferability reduces the need for domain-specific retraining when facing new query workloads.
  • Minimal inference overhead makes the approach suitable for production database environments.
  • Data generation through hybrid MCTS expansion provides a scalable way to create training examples without manual annotation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integrating such models into query optimizers could automate performance tuning in databases used by non-experts.
  • Extending the MCTS data synthesis to other optimization problems like index selection or join ordering might yield similar gains.
  • Lower model size could enable on-device or edge database query optimization in resource-constrained settings.

Load-bearing premise

The MCTS-generated synthetic slow queries capture the variety of performance bottlenecks present in real database workloads sufficiently well for the learned rewriting rules to apply broadly.

What would settle it

Evaluating the LASER-trained model on a collection of actual production SQL queries from diverse database systems and comparing the resulting execution times against unoptimized and baseline-rewritten versions would test if the claimed improvements hold.

Figures

Figures reproduced from arXiv: 2604.06804 by Jiahui Li, Rong Kang, Tieying Zhang, Tongwang Wu, Yunjun Gao, Yuren Mao.

Figure 1
Figure 1. Figure 1: An overview of LASER framework. predicates, deep nesting, and multi-table joins. This ensures that the generated seeds possess sufficient relational interdependence to support extensive cost-increasing transformations. Furthermore, an execution-based validation layer filters out invalid or empty-result candidates, guaranteeing that the subsequent MCTS exploration begins with a robust and logically meaningf… view at source ↗
Figure 2
Figure 2. Figure 2: The workflow of MCTS-Driven Slow Query Generation [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Impact of Rollout Budget on Optimization Discov [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of Advantage Estimation. Standard [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An Example of generated slow queries. query structures. This diversity and complexity present a significant advantage for training models. 6.9 Deployment at ByteDance To validate the practical applicability of our approach, we deployed the LASER-14B model within ByteDance’s production environment. We evaluated its effectiveness using several real-world datasets de￾rived from diverse business scenarios, suc… view at source ↗
read the original abstract

Query rewriting, the process of transforming queries into semantically equivalent yet more efficient variants, is crucial for database optimization. Existing solutions predominantly rely on either rule-based heuristics or Large Language Models (LLMs). However, traditional rule-based methods lack adaptability, while LLM-based approaches incur prohibitive inference costs and privacy risks. In contrast, Small Language Models (SLMs) present a compelling middle ground, potentially offering both flexibility and efficiency. However, the development of such compact models is severely bottlenecked by the scarcity of high-quality, domain-specific training data. To bridge this gap, we introduce LASER, a data-centric framework designed to empower small models for robust SQL optimization. First, to address the scarcity of existing benchmarks and the limited optimization headroom of generic synthetic queries, we construct SQL-MCTS, a large-scale corpus of complex slow queries. We employ an MCTS-based hybrid expansion strategy that combines rule-guided anti-patterns with LLM mutations to evolve structurally expressive seeds into execution-verified slow variants. Second, to enable the model to autonomously discover latency-aware rewriting patterns, we propose SQL-GRPO, a specialized alignment strategy adapted from Group Relative Policy Optimization. By integrating Anchored Group Advantage to refine advantage estimation and Complexity-Adaptive Dynamic Rollout to efficiently allocate exploration budgets, this approach effectively empowers compact models to master execution-based optimization logic. Implemented on Qwen3 models, LASER significantly outperforms rule-based systems and LLMs in execution efficiency, while exhibiting robust zero-shot transferability with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LASER, a data-centric framework for low-cost SQL query rewriting with small language models. It first builds SQL-MCTS, a large corpus of complex slow queries, via a hybrid MCTS expansion strategy that combines rule-guided anti-patterns with LLM mutations on seed queries followed by execution verification. It then applies SQL-GRPO, an adaptation of Group Relative Policy Optimization that incorporates Anchored Group Advantage for refined advantage estimation and Complexity-Adaptive Dynamic Rollout for efficient exploration, to train compact models (Qwen3) to discover latency-aware rewriting patterns. The central claim is that LASER significantly outperforms both rule-based systems and LLMs in execution efficiency while showing robust zero-shot transferability with minimal overhead.

Significance. If the experimental results and generalization claims hold, LASER would offer a practical middle ground between rigid rule-based optimizers and high-cost LLM-based rewriters, directly addressing data scarcity for domain-specific SLM training in databases. The explicit use of execution-verified synthetic data generation and the two GRPO adaptations (Anchored Group Advantage, Complexity-Adaptive Dynamic Rollout) constitute concrete, reusable contributions to applying RL-style alignment to query optimization.

major comments (2)
  1. [Abstract] Abstract: the assertions of 'significantly outperforms rule-based systems and LLMs in execution efficiency' and 'robust zero-shot transferability' are presented without any numerical results, baseline names, latency reduction percentages, success rates, dataset sizes, or statistical tests, making it impossible to evaluate the magnitude or reliability of the central performance claims.
  2. [SQL-MCTS construction] SQL-MCTS construction section: the hybrid MCTS strategy (rule-guided anti-patterns + LLM mutations + execution verification) is load-bearing for the zero-shot transfer claims, yet the manuscript supplies no analysis comparing the generated query distribution (structural mutations, latency profiles) against real production slow-query logs or external benchmarks such as TPC-DS or industry traces; without such validation the risk that learned patterns exploit generation artifacts rather than general optimization logic remains unaddressed.
minor comments (2)
  1. [Throughout] Ensure that all acronyms (SLM, MCTS, GRPO, SQL-GRPO) are defined at first use and used consistently in equations and figure captions.
  2. [Experimental figures] Figure captions for any latency or success-rate plots should explicitly state the number of queries, number of runs, and error bars or statistical tests used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the clarity and validation of our claims. We address each major comment below and outline the corresponding revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the assertions of 'significantly outperforms rule-based systems and LLMs in execution efficiency' and 'robust zero-shot transferability' are presented without any numerical results, baseline names, latency reduction percentages, success rates, dataset sizes, or statistical tests, making it impossible to evaluate the magnitude or reliability of the central performance claims.

    Authors: We agree that the abstract would be more informative with explicit quantitative support for the performance claims. In the revised manuscript we will expand the abstract to report key metrics from our experiments, including average latency reductions relative to rule-based baselines and LLMs, success rates on the evaluated workloads, dataset sizes used for training and testing, and references to the specific benchmarks (e.g., TPC-DS variants). This change will make the magnitude and reliability of the results immediately apparent to readers. revision: yes

  2. Referee: [SQL-MCTS construction] SQL-MCTS construction section: the hybrid MCTS strategy (rule-guided anti-patterns + LLM mutations + execution verification) is load-bearing for the zero-shot transfer claims, yet the manuscript supplies no analysis comparing the generated query distribution (structural mutations, latency profiles) against real production slow-query logs or external benchmarks such as TPC-DS or industry traces; without such validation the risk that learned patterns exploit generation artifacts rather than general optimization logic remains unaddressed.

    Authors: We acknowledge the value of distributional validation for the SQL-MCTS corpus. Our current evaluation already demonstrates zero-shot transfer on standard benchmarks including TPC-DS, and the execution-verification step ensures only genuinely slow queries are retained. However, the manuscript does not include an explicit side-by-side comparison of structural features or latency profiles against TPC-DS or external traces. We will add a dedicated analysis subsection that quantifies these aspects (e.g., query complexity distributions, anti-pattern coverage, and latency histograms) relative to TPC-DS and any available public query logs, thereby addressing the concern about potential generation artifacts. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline is self-contained

full rationale

The paper's core contributions are the SQL-MCTS data generation procedure (MCTS hybrid expansion + execution verification) and the SQL-GRPO alignment method (Anchored Group Advantage + Complexity-Adaptive Dynamic Rollout, adapted from published GRPO). Neither reduces by construction to its inputs: synthetic queries are generated and then filtered by runtime measurement, while the policy is trained and evaluated on held-out splits with external baselines. No equations equate a claimed performance gain to a fitted constant, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via self-citation. The zero-shot transfer claims rest on experimental results rather than definitional equivalence, making the derivation chain non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that synthetic queries generated by the hybrid MCTS process capture the optimization opportunities present in real workloads and that latency feedback during RL training produces generalizable rewriting rules.

axioms (2)
  • domain assumption MCTS hybrid expansion with rule-guided anti-patterns and LLM mutations produces structurally diverse, execution-verified slow queries that are useful for training.
    Invoked in the construction of SQL-MCTS corpus.
  • domain assumption Execution latency provides a reliable, non-noisy reward signal for policy optimization in SQL rewriting.
    Central to SQL-GRPO training loop.

pith-pipeline@v0.9.0 · 5591 in / 1379 out tokens · 48015 ms · 2026-05-10T17:10:49.967998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 19 canonical work pages · 7 internal anchors

  1. [1]

    https://www.autodl.com/

    [n.d.].AutoDL. https://www.autodl.com/

  2. [2]

    Common Table Expressions

    [n.d.]. Common Table Expressions. https://www.postgresql.org/docs/current/ queries-with.html

  3. [3]

    [n.d.]. SQLGlot. https://github.com/tobymao/sqlglot

  4. [4]

    https://console.volcengine.com

    [n.d.].VolcEngine. https://console.volcengine.com

  5. [5]

    Arash Ahmadian, Chris Cremer, Matthias Gallé, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet Üstün, and Sara Hooker. 2024. Back to Basics: Revisiting REINFORCE-Style Optimization for Learning from Human Feedback in LLMs. ACL, 12248–12267

  6. [6]

    Udbhav Bamba, Minghao Fang, Yifan Yu, Haizhong Zheng, and Fan Lai. 2025. XRPO: Pushing the limits of GRPO with Targeted Exploration and Exploitation. arXiv preprint arXiv:2510.06672(2025)

  7. [7]

    Barto, Richard S

    Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. 1983. Neuron- like adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and CyberneticsSMC-13, 5 (1983), 834–846

  8. [8]

    Mior, and Daniel Lemire

    Edmon Begoli, Jesús Camacho-Rodríguez, Julian Hyde, Michael J. Mior, and Daniel Lemire. 2018. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources.SIGMOD, 221–230

  9. [9]

    Browne, Edward Powley, Daniel Whitehouse, Simon M

    Cameron B. Browne, Edward Powley, Daniel Whitehouse, Simon M. Lucas, Peter I. Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. 2012. A Survey of Monte Carlo Tree Search Methods.IEEE Transactions on Computational Intelligence and AI in Games4, 1 (2012), 1–43

  10. [10]

    Minghan Chen, Guikun Chen, Wenguan Wang, and Yi Yang. 2025. Seed-grpo: Semantic entropy enhanced grpo for uncertainty-aware policy optimization. arXiv preprint arXiv:2505.12346(2025)

  11. [11]

    Xu Chen, Zhen Wang, Shuncheng Liu, Yaliang Li, Kai Zeng, Bolin Ding, Jingren Zhou, Han Su, and Kai Zheng. 2023. Base: Bridging the gap between cost and latency for query optimization.Proc. VLDB Endow.16, 8 (2023), 1958–1966

  12. [12]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V. Le, Sergey Levine, and Yi Ma. 2025. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training.arXiv preprint arXiv:2501.17161(2025)

  13. [13]

    Bailu Ding, Surajit Chaudhuri, Johannes Gehrke, and Vivek Narasayya. 2021. DSB: A decision support benchmark for workload-driven and traditional database systems.Proc. VLDB Endow.14, 13 (2021), 3376–3388

  14. [14]

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. 2025. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849(2025)

  15. [15]

    Lishui Fan, Yu Zhang, Mouxiang Chen, and Zhongxin Liu. 2025. Posterior- grpo: Rewarding reasoning processes in code generation.arXiv preprint arXiv:2508.05170(2025)

  16. [16]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  17. [17]

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Jo- hannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556(2022)

  18. [18]

    Levente Kocsis and Csaba Szepesvári. 2006. Bandit based monte-carlo planning. InProceedings of the 17th European Conference on Machine Learning. 282–293

  19. [19]

    Atharv Kulkarni and Vivek Srikumar. 2025. Reinforcing Code Genera- tion: Improving Text-to-SQL with Execution-Based Learning.arXiv preprint arXiv:2506.06093(2025)

  20. [20]

    Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, and Cuiping Li. 2025. OmniSQL: Synthesizing High-Quality Text-to-SQL Data at Scale.Proc. VLDB Endow.(2025), 4695–4709

  21. [21]

    Jiahui Li, Tongwang Wu, Yuren Mao, Yunjun Gao, Yajie Feng, and Huaizhong Liu

  22. [22]

    SQL-Factory: A Multi-Agent Framework for High-Quality and Large-Scale SQL Generation.arXiv preprint arXiv:2504.14837(2025)

  23. [23]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing. 2024. LLM-R2: A Large Language Model Enhanced Rule-Based Rewrite System for Boosting Query Efficiency.Proc. VLDB Endow.18, 1 (2024), 53–65

  24. [24]

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. 2024. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437(2024)

  25. [25]

    Jie Liu and Barzan Mozafari. 2024. Query rewriting via large language models. arXiv preprint arXiv:2403.09060(2024)

  26. [26]

    Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: a learned query optimizer.Proc. VLDB Endow.12, 11 (2019), 1705–1718

  27. [27]

    Raghunath Othayoth Nambiar and Meikel Poess. 2006. The making of TPC-DS. InVLDB

  28. [28]

    OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

  29. [29]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback.NeurIPS (2022), 27730–27744

  30. [30]

    Meikel Poess and Chris Floyd. 2000. New TPC benchmarks for decision support and web commerce.ACM SIGMOD Record29, 4 (2000), 64–71

  31. [31]

    Mohammadreza Pourreza, Shayan Talaei, Ruoxi Sun, Xingchen Wan, Hailong Li, Azalia Mirhoseini, Amin Saberi, Sercan Arik, et al . 2025. Reasoning-sql: Reinforcement learning with sql tailored partial rewards for reasoning-enhanced text-to-sql.arXiv preprint arXiv:2503.23157(2025)

  32. [32]

    Suming Qiu, Jing Li, Zhicheng Zhou, Junjie Huang, Linyuan Qiu, and Zhijie Sun

  33. [33]

    HES-SQL: Hybrid Reasoning for Efficient Text-to-SQL with Structural Skeleton Guidance.arXiv preprint arXiv:2510.08896(2025)

  34. [34]

    Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. [n.d.]. Improving language understanding by generative pre-training. ([n. d.])

  35. [35]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct preference optimization: Your language model is secretly a reward model.Advances in neural information processing systems(2023), 53728–53741

  36. [36]

    Praveen Seshadri, Hamid Pirahesh, and TY Cliff Leung. 1996. Complex Query Decorrelation. InICDE. 450–450

  37. [37]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300 (2024)

  38. [38]

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. 2025. Hybridflow: A flexible and efficient rlhf framework. InEuroSys. 1279–1297

  39. [39]

    Yuyang Song, Hanxu Yan, Jiale Lao, Yibo Wang, Yufei Li, Yuanchun Zhou, Jianguo Wang, and Mingjie Tang. 2025. QUITE: A Query Rewrite System Beyond Rules with LLM Agents.arXiv preprint arXiv:2506.07675(2025)

  40. [40]

    Zhaoyan Sun, Xuanhe Zhou, Guoliang Li, Xiang Yu, Jianhua Feng, and Yong Zhang. 2025. R-Bot: An LLM-Based Query Rewrite System.Proc. VLDB Endow. 18, 12 (2025), 5031–5044

  41. [41]

    Zhaoguo Wang, Zhou Zhou, Yicun Yang, Haoran Ding, Gansen Hu, Ding Ding, Chuzhe Tang, Haibo Chen, and Jinyang Li. 2022. Wetune: Automatic discovery and verification of query rewrite rules. InSIGMOD. 94–107

  42. [42]

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reason- ing in large language models.NeurIPS35 (2022), 24824–24837

  43. [43]

    Dongjie Xu, Yue Cui, Weijie Shi, Qingzhi Ma, Hanghui Guo, Jiaming Li, Yao Zhao, Ruiyuan Zhang, Shimin Di, Jia Zhu, et al. 2025. E3-rewrite: Learning to rewrite sql for executability, equivalence, and efficiency.arXiv preprint arXiv:2508.09023 (2025)

  44. [44]

    Cong Yan, Yin Lin, and Yeye He. 2023. Predicate pushdown for data science pipelines.SIGMOD1, 2 (2023), 1–28

  45. [45]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  46. [46]

    Zongheng Yang, Wei-Lin Chiang, Sifei Luan, Gautam Mittal, Michael Luo, and Ion Stoica. 2022. Balsa: Learning a query optimizer without expert demonstrations. InSIGMOD. 931–944

  47. [47]

    Bohan Zhai, Canwen Xu, Yuxiong He, and Zhewei Yao. 2025. Optimizing Rea- soning for Text-to-SQL with Execution Feedback. InACL. 19206–19218

  48. [48]

    Yunjia Zhang, Yannis Chronis, Jignesh M Patel, and Theodoros Rekatsinas. 2023. Simple adaptive query processing vs. learned query optimizers: Observations and analysis.Proc. VLDB Endow.16, 11 (2023), 2962–2975

  49. [49]

    Haizhong Zheng, Yang Zhou, Brian R Bartoldson, Bhavya Kailkhura, Fan Lai, Jiawei Zhao, and Beidi Chen. 2025. Act Only When It Pays: Efficient Rein- forcement Learning for LLM Reasoning via Selective Rollouts.arXiv preprint arXiv:2506.02177(2025)

  50. [50]

    Han Zhong, Zikang Shan, Guhao Feng, Wei Xiong, Xinle Cheng, Li Zhao, Di He, Jiang Bian, and Liwei Wang. 2024. Dpo meets ppo: Reinforced token optimization for rlhf.arXiv preprint arXiv:2404.18922(2024)

  51. [51]

    Xuanhe Zhou, Guoliang Li, Chengliang Chai, and Jianhua Feng. 2021. A learned query rewrite system using monte carlo tree search.Proc. VLDB Endow.15, 1 (2021), 46–58. 13