Recognition: unknown
FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use
Pith reviewed 2026-05-10 09:30 UTC · model grok-4.3
The pith
FD-NL2SQL improves its translation of natural-language questions into executable SQL for oncology trial databases by adding clinician edits and automatically generated variants to its exemplar bank.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FD-NL2SQL is a feedback-driven clinical NL2SQL assistant that decomposes a natural-language question into predicate-level sub-questions, retrieves semantically similar expert-verified exemplars via embeddings, synthesizes executable SQL conditioned on the decomposition, exemplars, and schema, and then improves with use through two update signals: approved clinician edits added to the exemplar bank and logic-based single atomic mutations on valid SQL whose non-empty results trigger a second LLM to generate corresponding natural-language questions and predicate decompositions.
What carries the argument
Two update signals that expand the exemplar bank: direct addition of clinician-approved SQL edits plus single atomic SQL mutations filtered by non-empty results and paired with second-LLM natural-language questions and decompositions.
If this is right
- The exemplar bank grows with every approved edit and every accepted mutation, increasing the chance of retrieving relevant past examples for new questions.
- New training examples are created without extra human labeling because the mutation process and second LLM handle the natural-language side automatically.
- Clinicians receive an interactive view of decomposition, retrieval, synthesis, and execution that lets them refine outputs and feed those refinements back into the system.
- Only SQL variants that return non-empty results and pass post-processing validity checks are retained, limiting the injection of broken examples.
- Over repeated use the system shifts from relying solely on the initial schema-aware LLM toward greater use of domain-specific exemplars.
Where Pith is reading between the lines
- The same two-signal pattern could be tested on NL2SQL tasks in other data-rich but annotation-poor domains such as clinical trials outside oncology or hospital billing records.
- Measuring the growth rate of the exemplar bank and the change in top-k retrieval accuracy after a fixed number of sessions would give a concrete test of whether the feedback loop is delivering compounding value.
- If the mutation step is kept deliberately small, the approach may scale to larger databases where exhaustive search for new examples would be costly.
- The design implicitly treats the second LLM as a cheap data-augmentation engine rather than a primary reasoner, which could be compared against simply storing raw SQL variants without the generated natural-language side.
Load-bearing premise
That the second LLM will produce accurate and semantically useful natural-language questions and predicate decompositions for the mutated SQL variants so they actually help future retrieval and synthesis.
What would settle it
After a sequence of real clinician sessions, measure whether retrieval precision or end-to-end SQL correctness on new questions rises or falls; if the added exemplars produce no net gain or cause more retrieval errors, the improvement claim is falsified.
Figures
read the original abstract
Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes FD-NL2SQL, a feedback-driven NL2SQL assistant for SQLite-based oncology trial databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified exemplars via sentence embeddings, synthesizes executable SQL conditioned on the decomposition/retrieved exemplars/schema, and applies post-processing validity checks. The system is designed to improve with use via two update signals: (i) approved clinician edits of generated SQL are added to the exemplar bank, and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retains variants only if they return non-empty results, and uses a second LLM to generate the corresponding natural-language question and predicate decomposition for accepted variants, thereby expanding the exemplar bank without additional annotation.
Significance. If the auto-augmented exemplars generated by the second LLM prove accurate and semantically aligned with the mutated SQL, the approach could enable scalable, annotation-free improvement in clinical NL2SQL systems by combining retrieval-augmented generation with lightweight feedback loops. The architecture is coherent and integrates standard components (LLM decomposition, embedding-based retrieval, validity filtering) in a manner that directly addresses the annotation bottleneck for domain-specific querying.
major comments (1)
- [Abstract] Abstract: The headline claim that FD-NL2SQL 'improves with use' through the lightweight logic-based SQL augmentation mechanism rests entirely on the untested assumption that the second LLM will produce accurate natural-language questions and predicate decompositions for the mutated SQL variants. No quantitative evaluation (e.g., before/after retrieval or synthesis accuracy on a held-out set of clinical queries), error analysis, or baseline comparison is provided to show that the auto-augmented exemplars are net-positive rather than noise; non-empty result filtering alone does not establish semantic usefulness for future retrieval or synthesis steps.
minor comments (1)
- [Abstract] Abstract: The description of the two update signals is packed into a single dense sentence; separating the clinician-edit pathway from the automatic augmentation pathway into distinct sentences or bullets would improve clarity for readers.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the system's architecture and its relevance to the annotation bottleneck in clinical NL2SQL. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that FD-NL2SQL 'improves with use' through the lightweight logic-based SQL augmentation mechanism rests entirely on the untested assumption that the second LLM will produce accurate natural-language questions and predicate decompositions for the mutated SQL variants. No quantitative evaluation (e.g., before/after retrieval or synthesis accuracy on a held-out set of clinical queries), error analysis, or baseline comparison is provided to show that the auto-augmented exemplars are net-positive rather than noise; non-empty result filtering alone does not establish semantic usefulness for future retrieval or synthesis steps.
Authors: We agree that the manuscript provides no quantitative evaluation of the auto-augmented exemplars' impact on downstream retrieval or synthesis accuracy, nor any error analysis or baseline comparison. As a system demonstration paper, the primary evidence for improvement with use is the direct clinician-edit path (approved SQL added to the exemplar bank). The logic-based augmentation path is presented as a lightweight, annotation-free mechanism that applies single atomic mutations, retains only non-empty-result variants, and uses a second LLM to generate aligned NL questions and predicate decompositions. We acknowledge that non-empty filtering alone does not guarantee semantic usefulness for future retrieval and that the quality of the LLM-generated NL/decomposition pairs remains an assumption without empirical validation in the current version. In revision we will (a) revise the abstract to distinguish the demonstrated clinician-feedback improvement from the proposed augmentation mechanism, (b) add a dedicated subsection with qualitative examples of mutation + LLM-generated NL pairs, and (c) include an explicit limitations paragraph stating that quantitative before/after evaluation on held-out clinical queries is planned future work. revision: partial
Circularity Check
No circularity: engineering system description with no derivations or self-referential reductions
full rationale
The paper describes an applied NL2SQL demo system whose core mechanisms (LLM decomposition, embedding-based retrieval, SQL synthesis, clinician feedback, and logic-based SQL mutation with non-empty filtering plus secondary LLM labeling) are presented as external calls and simple rules. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. No self-citations are invoked as load-bearing premises, and the feedback loop is not shown to reduce to its own inputs by construction. The architecture is therefore self-contained as an engineering artifact rather than a closed mathematical derivation.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Schema-aware LLMs can reliably decompose natural-language clinical questions into predicate-level sub-questions
- domain assumption Sentence embeddings retrieve semantically similar expert-verified NL2SQL exemplars that improve synthesis quality
- domain assumption A second LLM can generate accurate natural-language questions and predicate decompositions for logic-mutated SQL variants
Reference graph
Works this paper leans on
-
[1]
Business intelligence and analytics: From big data to big impact.MIS quarterly, 36(4):1165–1188. Team Gemma. 2025. Gemma 3. Team GPT-5. 2025. Openai gpt-5 system card.Preprint, arXiv:2601.03267. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Seb...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Curran Associates, Inc. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extrac- tion and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. A...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning
Seq2sql: Generating structured queries from natural language using reinforcement learning. Preprint, arXiv:1709.00103. A Prompts Details 8 Prompt A: SQL-2-NL Read the original question, then rewrite the following SQL to a question to sound naturally what a clinician would ask while ,→keeping intent exactly the same as the SQL. Rules: - Output ONLY one nat...
work page internal anchor Pith review arXiv 2018
-
[4]
<THINKING>...</THINKING>
-
[5]
- Use only columns from the schema
<SQL>...</SQL> SQL RULES (STRICT): - The SQL MUST start with SELECT. - Use only columns from the schema. - Put double quotes around EVERY column name exactly as in the schema. - Use single quotes for string literals. - End with a semicolon. - No extra text outside <THINKING> and <SQL>. REASONING SCAFFOLD TO SHOW (do not skip steps, keep numbering): <THINKING>
-
[6]
Normalize terms: Rewrite entities to match schema values (title case for categorical values)
-
[7]
Identify intent: State the user intent in one sentence (list/filter trials, etc.)
-
[8]
Map to columns: List each extracted constraint and the exact schema column it maps to
-
[9]
Choose constraints: Specify operators (=, >=, IN, LIKE) and how constraints combine (AND/OR)
-
[10]
show NCT and Author
Compose query: - List the selected output columns and justify briefly (key identifiers + question-relevant fields). - Then write the final SQL in <SQL>. </THINKING> GOOD EXAMPLE (CORRECT SCHEMA USE): ....... BAD EXAMPLE (INCORRECT / WHAT NOT TO DO): ...... Now, answer the following. <USER>QUESTION: {QUESTION}</USER> <ASSISTANT> <THINKING> </THINKING> <SQL...
-
[11]
Parent question intent and constraints
-
[12]
Decomposed sub-question predicates
-
[13]
Class of ICI
Retrieved seed SQL examples (hints only). Predicate budget rule: - The final WHERE clause must include only intent-supported predicates from parent question + decomposition. - Do not add extra predicates from retrieved seeds. - If decomposition has N independent constraints, final WHERE should represent those N constraints unless one is ,→contradictory or...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.