arxiv: 2604.25325 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.AI· cs.CL

R³-SQL: Ranking Reward and Resampling for Text-to-SQL

Hojae Han , Yeonseok Jeong , Seung-won Hwang , Zhewei Yao , Yuxiong He This is my paper

Pith reviewed 2026-05-07 16:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords text-to-sqlsql generationcandidate rankingresamplingexecution accuracylarge language modelsnatural language interfacesquery synthesis

0 comments

The pith

R³-SQL groups Text-to-SQL candidates by execution results for consistent ranking and adds agentic resampling to recover missing correct queries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes R³-SQL to fix two problems in modern Text-to-SQL systems that generate multiple candidate queries and then rank them. Existing rankers often assign different scores to queries that produce identical results when executed, and they cannot recover when the correct query is never generated in the first pool. R³-SQL groups candidates by their execution outcomes so that equivalent queries receive the same rank, then scores each group with a reward that blends pairwise comparisons between groups and a pointwise measure of group quality. It also adds an agentic judge that decides whether to resample new candidates when the correct query is probably absent. These steps produce higher execution accuracy on five standard benchmarks.

Core claim

R³-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines a pairwise preference across groups with a pointwise utility from the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R³-SQL introduces agentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent.

What carries the argument

Execution-result grouping that forces consistent ranks across functionally equivalent queries, paired with a unified reward that merges pairwise group preferences and pointwise utility from rank and size, plus an agentic resampling judge that detects and corrects missing correct candidates.

If this is right

Ranking becomes identical for all queries that produce the same execution result.
Correct SQL can be recovered even if it is absent from the first round of generations.
Execution accuracy rises to 75.03 on BIRD-dev, the highest reported for methods with disclosed model sizes.
Gains appear consistently across five Text-to-SQL benchmarks.
The framework works by improving the ranking and selection stage without requiring larger base models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Execution-based grouping for ranking could be tested on other structured generation tasks such as code synthesis or query reformulation where functional equivalence matters more than surface form.
The agentic resampling judge offers a template for adding self-correction loops in broader LLM pipelines that rely on sampling multiple outputs.
If execution feedback is noisy or platform-dependent, the grouping signal may require additional safeguards such as multiple environments or result normalization.

Load-bearing premise

Grouping candidates by execution result supplies a reliable and consistent ranking signal and the agentic resampling judge can accurately detect when the correct SQL is absent from the initial pool.

What would settle it

A controlled experiment showing that accuracy gains disappear when the resampling step is turned off on benchmarks where the initial pool often lacks the correct SQL, or when many queries with identical execution results turn out to be functionally incorrect.

Figures

Figures reproduced from arXiv: 2604.25325 by Hojae Han, Seung-won Hwang, Yeonseok Jeong, Yuxiong He, Zhewei Yao.

**Figure 1.** Figure 1: Comparison of ranking strategies for six SQL candidates, where view at source ↗

**Figure 2.** Figure 2: Groupwise Ranking by R3 -SQL. decision is incorrect, 1 if it is correct, and 1.5 if it is correct and remains consistent across both input orders. 3.4 Overall Framework As illustrated in view at source ↗

**Figure 3.** Figure 3: EX stability across 4 random seeds on BIRD view at source ↗

**Figure 4.** Figure 4: Execution accuracy (EX) of R 3 -SQL on BIRD-dev across the threshold τ in Eq. (2). Each point ranks n“32 SQL candidates per query generated by Arctic-Text2SQL-R1-32B with nucleus sampling (T“0.8). mean score only when the pairwise ranker strongly indicates that the group consists of incorrect candidates; otherwise the group is assigned the default score of 1. This confidence-gated strategy reduces bias fr… view at source ↗

**Figure 5.** Figure 5: Real BIRD-dev case studies illustrating functional inconsistency in pointwise ranking, sorted by pointwise view at source ↗

**Figure 6.** Figure 6: Prompt for the Initial Pool Generation stage, where the LLM generates SQL candidates with chain-of view at source ↗

**Figure 7.** Figure 7: Prompt template for pairwise reward model comparing two SQL candidates. view at source ↗

**Figure 8.** Figure 8: Full system prompt for the LLM agent f for resampling decision. 18 view at source ↗

**Figure 9.** Figure 9: Full input prompt for the LLM agent f for resampling decision, populated with the natural language question, schema, and candidate SQL pool. 19 view at source ↗

read the original abstract

Modern Text-to-SQL systems generate multiple candidate SQL queries and rank them to judge a final prediction. However, existing methods face two limitations. First, they often score functionally equivalent SQL queries inconsistently despite identical execution results. Second, ranking cannot recover when the correct SQL is absent from the candidate pool. We propose R$^3$-SQL, a Text-to-SQL framework that addresses both issues through unified reward for ranking and resampling. R$^3$-SQL first groups candidates by execution result and ranks groups for consistency. To score each group, it combines a pairwise preference across groups with a pointwise utility from the best group rank and size, capturing relative preference, consistency, and candidate quality. To improve candidate recall, R$^3$-SQL introduces agentic resampling, which judges the generated candidate pool and selectively resamples when the correct SQL is likely absent. R$^3$-SQL achieves 75.03 execution accuracy on BIRD-dev, a new state of the art among methods using models with disclosed sizes, with consistent gains across five benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

R³-SQL groups SQL candidates by execution result for consistent ranking and adds agentic resampling for better recall, delivering reported SOTA gains on BIRD-dev but resting on an unproven correlation between execution equivalence and ranking quality.

read the letter

The main thing to know is that R³-SQL groups candidate SQL queries by their execution results to avoid scoring functionally equivalent queries inconsistently, then ranks those groups with a combined pairwise preference and pointwise utility score before using an agentic judge to decide when to resample for missing correct queries. This produces the claimed 75.03 execution accuracy on BIRD-dev and gains across five benchmarks among methods with disclosed model sizes. The framework is presented as a unified approach rather than separate fixes for ranking and recall. The paper does a solid job identifying the two limitations in prior Text-to-SQL work and showing how the group-based reward captures relative preference, consistency, and candidate quality in one mechanism. The resampling step directly targets the recall problem when the initial pool lacks the right SQL. Reporting consistent improvements on multiple benchmarks gives a practical sense of where the method helps. The soft spots center on the grouping step itself. Execution equivalence can hide semantic differences, and when multiple groups produce similar utilities the ranking signal may not reliably surface the best option. The stress-test concern lands because both the reward and the resampling decision depend on this grouping being a stable proxy for quality. The paper would benefit from more targeted ablations on cases where execution matches but semantics differ, plus separate checks on the judge's accuracy. The experiments support the headline numbers, but the absence of deeper failure analysis leaves the generality of the gains open. This work is for researchers and practitioners building LLM-based Text-to-SQL systems for data interfaces or agents. A reader focused on ranking and resampling techniques in code generation would extract usable ideas from the reward design. I would send it for peer review because the claims are concrete, the framework is clearly motivated, and the results are strong enough to merit referee scrutiny even with the validation gaps.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce R³-SQL, which groups Text-to-SQL candidate queries by execution results to rank them consistently using a unified reward combining pairwise group preferences and pointwise utilities from best rank and group size. It also uses an agentic resampling judge to detect when the correct SQL is missing from the pool and regenerate candidates. This leads to a reported 75.03 execution accuracy on BIRD-dev, a new SOTA among disclosed-size models, with gains on five benchmarks.

Significance. Should the central claims hold under scrutiny, the significance is moderate to high for the Text-to-SQL community. By tackling inconsistent scoring of equivalent queries and pool recall issues through execution-based grouping and agentic judgment, it offers a novel way to boost performance without larger models. The SOTA result on BIRD-dev and consistent gains suggest practical impact, though the strength depends on the robustness of the experimental validation.

major comments (3)

[Method (grouping and ranking)] The unified reward for ranking groups by execution result is load-bearing for the consistency claim and the 75.03 accuracy. However, the manuscript does not address cases where multiple groups have similar utilities or execution equivalence masks semantic differences, which could lead to unreliable ranking signals.
[Agentic resampling] The agentic resampling judge's ability to accurately detect absence of correct SQL is critical for the recall improvement and overall gains. The paper provides no error analysis or accuracy metrics for this judge, leaving the weakest assumption untested.
[Experiments] Table reporting BIRD-dev results: The SOTA claim lacks detailed comparisons, ablations on reward vs resampling components, and error analysis. This makes it hard to confirm the gains are due to the proposed method.

minor comments (2)

The abstract mentions 'consistent gains across five benchmarks' but does not specify which benchmarks or the magnitude of gains, which would aid clarity.
[Notation] The definitions of pairwise preference and pointwise utility could be formalized with equations for better reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our paper R³-SQL. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Method (grouping and ranking)] The unified reward for ranking groups by execution result is load-bearing for the consistency claim and the 75.03 accuracy. However, the manuscript does not address cases where multiple groups have similar utilities or execution equivalence masks semantic differences, which could lead to unreliable ranking signals.

Authors: The unified reward combines pairwise group preferences with pointwise utilities from best rank and group size to ensure consistent scoring of execution-equivalent queries. While the original manuscript did not include explicit analysis of near-tie utilities or masked semantic differences, execution equivalence is the appropriate criterion given the evaluation protocol. We will add a discussion subsection on these edge cases, including utility distribution statistics and tie-breaking behavior, as a partial revision. revision: partial
Referee: [Agentic resampling] The agentic resampling judge's ability to accurately detect absence of correct SQL is critical for the recall improvement and overall gains. The paper provides no error analysis or accuracy metrics for this judge, leaving the weakest assumption untested.

Authors: We agree that error analysis for the resampling judge is needed to substantiate its role. The original submission focused on end-to-end gains. In the revision we will add a new subsection with judge accuracy metrics (precision/recall on detecting missing correct SQL) and a small-scale manual verification of its decisions on BIRD-dev samples. revision: yes
Referee: [Experiments] Table reporting BIRD-dev results: The SOTA claim lacks detailed comparisons, ablations on reward vs resampling components, and error analysis. This makes it hard to confirm the gains are due to the proposed method.

Authors: We acknowledge the table could be more informative. The current results show overall performance, but we will expand the experiments section with: (i) additional baseline comparisons, (ii) ablations isolating the unified reward ranking from the resampling component, and (iii) error analysis of failure modes. These changes will better attribute the 75.03% accuracy and support the SOTA claim. revision: yes

Circularity Check

0 steps flagged

No circularity: reward and resampling defined from external execution signals

full rationale

The framework groups candidates by execution result and scores groups via combined pairwise preference plus pointwise utility (best-group rank and size). These quantities are computed directly from observable execution equivalence and candidate statistics rather than fitted parameters or self-referential equations. The agentic resampling judge is introduced as a separate decision procedure without reducing to prior fitted values or author-only uniqueness theorems. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the derivation. The reported accuracy gains rest on empirical benchmarks, not tautological construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the high-level framework name; the central claim rests on the unstated assumption that execution equivalence is a sufficient proxy for functional correctness and that the resampling judge generalizes.

pith-pipeline@v0.9.0 · 5497 in / 1149 out tokens · 60218 ms · 2026-05-07T16:14:32.725195+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Proximal Policy Optimization Algorithms

Guiding retrieval using llm-based listwise rankers. InEuropean Conference on Information Retrieval, pages 230–246. Springer. John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Hao...

work page internal anchor Pith review arXiv 2017
[2]

Intent match: entities, filters, metrics, order, and top-k behavior align with the user query
[3]

Schema validity: the query uses correct tables/columns, required joins are present, and aggregations are legal
[4]

Execution sanity: the exec_preview has a plausible shape/values for the query (no obvious contradictions)
[5]

decision

No major red flags: units/ratios are handled reasonably, limit/order are coherent, and there are no clearly spurious tables or conditions. You should make a balanced judgment: - Marklikely_has_correct=trueif at least one candidate appears reasonably correct according to the above criteria. - Minor ambiguities are acceptable as long as the query and SQL ar...