pith. machine review for the scientific record. sign in

arxiv: 2604.15646 · v1 · submitted 2026-04-17 · 💻 cs.CL

Recognition: unknown

FD-NL2SQL: Feedback-Driven Clinical NL2SQL that Improves with Use

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords NL2SQLclinical databasesfeedback-driven learningoncology trialsSQL augmentationexemplar retrievalLLM decompositionnatural language interfaces
0
0 comments X

The pith

FD-NL2SQL improves its translation of natural-language questions into executable SQL for oncology trial databases by adding clinician edits and automatically generated variants to its exemplar bank.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a system that lets clinicians ask ad-hoc questions about biomarkers, endpoints, and interventions in SQLite oncology databases without writing SQL themselves. It breaks each question into predicate-level sub-questions, pulls in similar verified examples, and produces SQL with validity checks. Two built-in update signals then expand the set of examples: clinicians can approve or edit the output SQL for direct addition, and a lightweight process mutates valid SQL by one atomic change, keeps only those that return data, and uses a second model to write matching natural-language questions and decompositions. If these signals work, the system steadily becomes more accurate on the exact queries clinicians actually need, without requiring new manual annotation each time.

Core claim

FD-NL2SQL is a feedback-driven clinical NL2SQL assistant that decomposes a natural-language question into predicate-level sub-questions, retrieves semantically similar expert-verified exemplars via embeddings, synthesizes executable SQL conditioned on the decomposition, exemplars, and schema, and then improves with use through two update signals: approved clinician edits added to the exemplar bank and logic-based single atomic mutations on valid SQL whose non-empty results trigger a second LLM to generate corresponding natural-language questions and predicate decompositions.

What carries the argument

Two update signals that expand the exemplar bank: direct addition of clinician-approved SQL edits plus single atomic SQL mutations filtered by non-empty results and paired with second-LLM natural-language questions and decompositions.

If this is right

  • The exemplar bank grows with every approved edit and every accepted mutation, increasing the chance of retrieving relevant past examples for new questions.
  • New training examples are created without extra human labeling because the mutation process and second LLM handle the natural-language side automatically.
  • Clinicians receive an interactive view of decomposition, retrieval, synthesis, and execution that lets them refine outputs and feed those refinements back into the system.
  • Only SQL variants that return non-empty results and pass post-processing validity checks are retained, limiting the injection of broken examples.
  • Over repeated use the system shifts from relying solely on the initial schema-aware LLM toward greater use of domain-specific exemplars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-signal pattern could be tested on NL2SQL tasks in other data-rich but annotation-poor domains such as clinical trials outside oncology or hospital billing records.
  • Measuring the growth rate of the exemplar bank and the change in top-k retrieval accuracy after a fixed number of sessions would give a concrete test of whether the feedback loop is delivering compounding value.
  • If the mutation step is kept deliberately small, the approach may scale to larger databases where exhaustive search for new examples would be costly.
  • The design implicitly treats the second LLM as a cheap data-augmentation engine rather than a primary reasoner, which could be compared against simply storing raw SQL variants without the generated natural-language side.

Load-bearing premise

That the second LLM will produce accurate and semantically useful natural-language questions and predicate decompositions for the mutated SQL variants so they actually help future retrieval and synthesis.

What would settle it

After a sequence of real clinician sessions, measure whether retrieval precision or end-to-end SQL correctness on new questions rises or falls; if the added exemplars produce no net gain or cause more retrieval errors, the improvement claim is falsified.

Figures

Figures reproduced from arXiv: 2604.15646 by Irbaz Bin Riaz, Kaneez Zahra Rubab Khakwani, Manan Roy Choudhury, Mohamad Bassam Sonbol, Muhammad Ali Khan, Suparno Roy Chowdhury, Tejas Anvekar, Vivek Gupta.

Figure 1
Figure 1. Figure 1: A clinician question is decomposed into schema-aligned predicate sub-questions; for each pred￾icate, semantically similar expert-approved exemplars are retrieved; This guides schema-grounded SQL syn￾thesizer. Users can edit and approve the final SQL to update the exemplar bank. To expand coverage with minimal annotation, approved SQL is augmented by a single atomic mutation (e.g., operator or column substi… view at source ↗
Figure 2
Figure 2. Figure 2: FD-NL2SQL demo UI and feedback loop. Clinicians issue a natural-language query in the chat interface (right) and view executed results in the table view (left). The system shows the generated SQL, which an expert can accept, modify, or reject; accepted / edited queries are saved back to the exemplar bank to improve future retrieval and synthesis (autofill supports rapid refinement). similar question-SQL ex… view at source ↗
read the original abstract

Clinicians exploring oncology trial repositories often need ad-hoc, multi-constraint queries over biomarkers, endpoints, interventions, and time, yet writing SQL requires schema expertise. We demo FD-NL2SQL, a feedback-driven clinical NL2SQL assistant for SQLite-based oncology databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified NL2SQL exemplars via sentence embeddings, and synthesizes executable SQL conditioned on the decomposition, retrieved exemplars, and schema, with post-processing validity checks. To improve with use, FD-NL2SQL incorporates two update signals: (i) clinician edits of generated SQL are approved and added to the exemplar bank; and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retaining variants only if they return non-empty results. A second LLM generates the corresponding natural-language question and predicate decomposition for accepted variants, automatically expanding the exemplar bank without additional annotation. The demo interface exposes decomposition, retrieval, synthesis, and execution results to support interactive refinement and continuous improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper describes FD-NL2SQL, a feedback-driven NL2SQL assistant for SQLite-based oncology trial databases. Given a natural-language question, a schema-aware LLM decomposes it into predicate-level sub-questions, retrieves semantically similar expert-verified exemplars via sentence embeddings, synthesizes executable SQL conditioned on the decomposition/retrieved exemplars/schema, and applies post-processing validity checks. The system is designed to improve with use via two update signals: (i) approved clinician edits of generated SQL are added to the exemplar bank, and (ii) lightweight logic-based SQL augmentation applies a single atomic mutation (e.g., operator or column change), retains variants only if they return non-empty results, and uses a second LLM to generate the corresponding natural-language question and predicate decomposition for accepted variants, thereby expanding the exemplar bank without additional annotation.

Significance. If the auto-augmented exemplars generated by the second LLM prove accurate and semantically aligned with the mutated SQL, the approach could enable scalable, annotation-free improvement in clinical NL2SQL systems by combining retrieval-augmented generation with lightweight feedback loops. The architecture is coherent and integrates standard components (LLM decomposition, embedding-based retrieval, validity filtering) in a manner that directly addresses the annotation bottleneck for domain-specific querying.

major comments (1)
  1. [Abstract] Abstract: The headline claim that FD-NL2SQL 'improves with use' through the lightweight logic-based SQL augmentation mechanism rests entirely on the untested assumption that the second LLM will produce accurate natural-language questions and predicate decompositions for the mutated SQL variants. No quantitative evaluation (e.g., before/after retrieval or synthesis accuracy on a held-out set of clinical queries), error analysis, or baseline comparison is provided to show that the auto-augmented exemplars are net-positive rather than noise; non-empty result filtering alone does not establish semantic usefulness for future retrieval or synthesis steps.
minor comments (1)
  1. [Abstract] Abstract: The description of the two update signals is packed into a single dense sentence; separating the clinician-edit pathway from the automatic augmentation pathway into distinct sentences or bullets would improve clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the system's architecture and its relevance to the annotation bottleneck in clinical NL2SQL. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that FD-NL2SQL 'improves with use' through the lightweight logic-based SQL augmentation mechanism rests entirely on the untested assumption that the second LLM will produce accurate natural-language questions and predicate decompositions for the mutated SQL variants. No quantitative evaluation (e.g., before/after retrieval or synthesis accuracy on a held-out set of clinical queries), error analysis, or baseline comparison is provided to show that the auto-augmented exemplars are net-positive rather than noise; non-empty result filtering alone does not establish semantic usefulness for future retrieval or synthesis steps.

    Authors: We agree that the manuscript provides no quantitative evaluation of the auto-augmented exemplars' impact on downstream retrieval or synthesis accuracy, nor any error analysis or baseline comparison. As a system demonstration paper, the primary evidence for improvement with use is the direct clinician-edit path (approved SQL added to the exemplar bank). The logic-based augmentation path is presented as a lightweight, annotation-free mechanism that applies single atomic mutations, retains only non-empty-result variants, and uses a second LLM to generate aligned NL questions and predicate decompositions. We acknowledge that non-empty filtering alone does not guarantee semantic usefulness for future retrieval and that the quality of the LLM-generated NL/decomposition pairs remains an assumption without empirical validation in the current version. In revision we will (a) revise the abstract to distinguish the demonstrated clinician-feedback improvement from the proposed augmentation mechanism, (b) add a dedicated subsection with qualitative examples of mutation + LLM-generated NL pairs, and (c) include an explicit limitations paragraph stating that quantitative before/after evaluation on held-out clinical queries is planned future work. revision: partial

Circularity Check

0 steps flagged

No circularity: engineering system description with no derivations or self-referential reductions

full rationale

The paper describes an applied NL2SQL demo system whose core mechanisms (LLM decomposition, embedding-based retrieval, SQL synthesis, clinician feedback, and logic-based SQL mutation with non-empty filtering plus secondary LLM labeling) are presented as external calls and simple rules. No equations, fitted parameters, predictions, or uniqueness theorems appear in the provided text. No self-citations are invoked as load-bearing premises, and the feedback loop is not shown to reduce to its own inputs by construction. The architecture is therefore self-contained as an engineering artifact rather than a closed mathematical derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on assumptions about LLM reliability for decomposition and variant generation rather than new mathematical constructs or fitted parameters.

axioms (3)
  • domain assumption Schema-aware LLMs can reliably decompose natural-language clinical questions into predicate-level sub-questions
    Invoked in the initial decomposition step that conditions SQL synthesis
  • domain assumption Sentence embeddings retrieve semantically similar expert-verified NL2SQL exemplars that improve synthesis quality
    Central to the retrieval component
  • domain assumption A second LLM can generate accurate natural-language questions and predicate decompositions for logic-mutated SQL variants
    Required for automatic exemplar expansion without human annotation

pith-pipeline@v0.9.0 · 5535 in / 1598 out tokens · 44102 ms · 2026-05-10T09:30:55.477941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    OpenAI GPT-5 System Card

    Business intelligence and analytics: From big data to big impact.MIS quarterly, 36(4):1165–1188. Team Gemma. 2025. Gemma 3. Team GPT-5. 2025. Openai gpt-5 system card.Preprint, arXiv:2601.03267. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Hein- rich Küttler, Mike Lewis, Wen-tau Yih, Tim Rock- täschel, Seb...

  2. [2]

    Qwen3 Technical Report

    Curran Associates, Inc. Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2022. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extrac- tion and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online. A...

  3. [3]

    Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning

    Seq2sql: Generating structured queries from natural language using reinforcement learning. Preprint, arXiv:1709.00103. A Prompts Details 8 Prompt A: SQL-2-NL Read the original question, then rewrite the following SQL to a question to sound naturally what a clinician would ask while ,→keeping intent exactly the same as the SQL. Rules: - Output ONLY one nat...

  4. [4]

    <THINKING>...</THINKING>

  5. [5]

    - Use only columns from the schema

    <SQL>...</SQL> SQL RULES (STRICT): - The SQL MUST start with SELECT. - Use only columns from the schema. - Put double quotes around EVERY column name exactly as in the schema. - Use single quotes for string literals. - End with a semicolon. - No extra text outside <THINKING> and <SQL>. REASONING SCAFFOLD TO SHOW (do not skip steps, keep numbering): <THINKING>

  6. [6]

    Normalize terms: Rewrite entities to match schema values (title case for categorical values)

  7. [7]

    Identify intent: State the user intent in one sentence (list/filter trials, etc.)

  8. [8]

    Map to columns: List each extracted constraint and the exact schema column it maps to

  9. [9]

    Choose constraints: Specify operators (=, >=, IN, LIKE) and how constraints combine (AND/OR)

  10. [10]

    show NCT and Author

    Compose query: - List the selected output columns and justify briefly (key identifiers + question-relevant fields). - Then write the final SQL in <SQL>. </THINKING> GOOD EXAMPLE (CORRECT SCHEMA USE): ....... BAD EXAMPLE (INCORRECT / WHAT NOT TO DO): ...... Now, answer the following. <USER>QUESTION: {QUESTION}</USER> <ASSISTANT> <THINKING> </THINKING> <SQL...

  11. [11]

    Parent question intent and constraints

  12. [12]

    Decomposed sub-question predicates

  13. [13]

    Class of ICI

    Retrieved seed SQL examples (hints only). Predicate budget rule: - The final WHERE clause must include only intent-supported predicates from parent question + decomposition. - Do not add extra predicates from retrieved seeds. - If decomposition has N independent constraints, final WHERE should represent those N constraints unless one is ,→contradictory or...