pith. sign in

arxiv: 2606.04500 · v1 · pith:P2TFLVQGnew · submitted 2026-06-03 · 💻 cs.CL

SANE Schema-aware Natural-language Evaluation of Biological Data

Pith reviewed 2026-06-28 06:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords text-to-SQLlarge language modelsschema-aware promptingbiological datanatural language interfacesfew-shot evaluationdatabase query generation
0
0 comments X

The pith

Few-shot LLMs generate accurate SQL for biological microscopy data using schema-aware prompting and guardrails without any training or fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SANE as a schema-grounded method to automatically build benchmarks for text-to-SQL tasks drawn from real high-throughput microscopy experiments. It tests a few-shot large language model on these benchmarks and finds that structured prompts plus guardrails produce correct queries reliably when the underlying schemas are constrained. Failures occur mainly when user inputs are ambiguous, resulting in clarification requests rather than erroneous SQL. This setup shows that natural-language database access can work in well-defined scientific domains without custom model adaptation.

Core claim

SANE creates automatically generated, schema-grounded benchmarks tied to actual experimental structures in biological datasets. Evaluation with these benchmarks demonstrates that a few-shot large language model achieves accurate query generation under constrained schemas when combined with structured prompting and guardrails, without model training or fine-tuning. Most observed failures arise from ambiguous or underspecified inputs and appear as overly cautious clarification requests rather than incorrect SQL output.

What carries the argument

SANE (Schema-Aware Natural-language Evaluation), a paradigm that produces schema-grounded, automatically generated benchmarks tied to real experimental structures for reproducible text-to-SQL assessment.

If this is right

  • Natural-language interfaces can provide reliable access to structured biological datasets without training specialized models.
  • Guardrails tied to the schema reduce hallucinations in generated SQL queries.
  • Text-to-SQL evaluation becomes scalable and reproducible through automatic benchmark generation from experimental metadata.
  • Error patterns point to input disambiguation as the main remaining challenge rather than query generation accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting approach may extend to other domains with rigid schemas such as clinical trial data or genomic annotations.
  • Systems could incorporate automatic query reformulation to handle underspecified inputs before generating SQL.
  • Benchmark generation could be generalized to track changes in schema or data collection protocols over time.

Load-bearing premise

Biological data schemas are constrained and well-defined enough that schema-aware prompting and guardrails can prevent incorrect SQL even on real user questions.

What would settle it

Run the system on a set of real, ambiguous user questions about the microscopy dataset and check whether it produces any incorrect SQL statements instead of requesting clarification.

Figures

Figures reproduced from arXiv: 2606.04500 by Markus Reischl, Martin Krueger, Rolf Gattung.

Figure 1
Figure 1. Figure 1: Hierarchical dataset structure of our high [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: SANE benchmark generation. The database is systematically queried to extract schema structure and experimental content. Test queries are generated with corresponding ground truth SQL. The LLM is evaluated by comparing predicted context classifications (sufficient vs.missing) and generated SQL execution results against ground truth. tion request is returned. Otherwise we generate a SQL statement with the LL… view at source ↗
read the original abstract

High-throughput microscopy generates large, structured datasets capturing cellular responses to pharmacological perturbations, but accessing these datasets typically requires SQL expertise. Large language models offer a natural-language alternative, yet their tendency to hallucinate raises concerns about result reliability . We present SANE Schema-Aware Natural-language Evaluation, a novel paradigm for domain-specific text-to-SQL evaluation: schema-grounded, automatically generated benchmarks tied to real and specific experimental structure. SANE makes evaluation more scalable, systematic, and reproducible. Using SANE, we evaluate a few-shot large language model and show that, under constrained schemas with structured prompting and guardrails, accurate query generation is achievable without any model training or fine-tuning. Most failures stem from ambiguous or underspecified inputs and manifest as overly cautious clarification requests or answers to queries that should first be disambiguated, rather than incorrect SQL generation. These results indicate that few-shot large language models can provide reliable database access in well-defined domains when combined with schema-aware prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces SANE (Schema-aware Natural-language Evaluation), a paradigm for domain-specific text-to-SQL evaluation using schema-grounded, automatically generated benchmarks based on real experimental structures in biological data from high-throughput microscopy. It evaluates a few-shot large language model with structured prompting and guardrails, asserting that accurate query generation is achievable without model training or fine-tuning, with most failures manifesting as clarification requests rather than incorrect SQL outputs.

Significance. If the central claim holds, the work would show that constrained schemas combined with prompting techniques can enable reliable natural language interfaces to complex biological datasets, reducing the need for SQL expertise. The SANE framework for scalable and reproducible benchmark generation represents a methodological contribution that could be applied to other domains.

major comments (2)
  1. [Abstract] Abstract: the claim that 'accurate query generation is achievable' and that 'most failures stem from ambiguous or underspecified inputs' is presented without any quantitative metrics, such as accuracy rates, number of test queries, error categorization counts, or baseline comparisons, making the central empirical assertion unevidenced.
  2. [Abstract] Abstract and SANE description: the benchmarks are automatically generated from the schema and tied to the experimental structure; this construction may systematically produce queries with explicit constraints, which could mean the observed behavior (low incorrect SQL, mostly clarifications) does not generalize to real user questions that often involve implicit joins, ambiguous column references, or omitted filters.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'accurate query generation is achievable' and that 'most failures stem from ambiguous or underspecified inputs' is presented without any quantitative metrics, such as accuracy rates, number of test queries, error categorization counts, or baseline comparisons, making the central empirical assertion unevidenced.

    Authors: The abstract is a high-level summary. The full manuscript provides the requested quantitative details (accuracy rates, test query counts, error categorizations, and baseline comparisons) in the evaluation section. We will revise the abstract to include summary metrics and key counts. revision: yes

  2. Referee: [Abstract] Abstract and SANE description: the benchmarks are automatically generated from the schema and tied to the experimental structure; this construction may systematically produce queries with explicit constraints, which could mean the observed behavior (low incorrect SQL, mostly clarifications) does not generalize to real user questions that often involve implicit joins, ambiguous column references, or omitted filters.

    Authors: We acknowledge the concern regarding generalization. The SANE construction prioritizes schema-grounded, reproducible benchmarks tied to real experimental structures, as described in the methods. The manuscript already notes design implications in the discussion and limitations sections. We will expand the limitations paragraph to explicitly address differences from real-user queries containing implicit elements. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical performance report on generated benchmarks

full rationale

The paper reports an empirical evaluation of few-shot LLM text-to-SQL performance under schema-aware prompting and guardrails. SANE benchmarks are defined as automatically generated from the schema and tied to experimental structure, but the central claim (accurate query generation is achievable without training, with failures mainly from underspecification) is presented as an observation from running the evaluation rather than a derived prediction or quantity that reduces to the generation process by construction. No equations, fitted parameters, self-citations as load-bearing premises, or uniqueness theorems appear. The result is an experimental finding on the tested distribution and does not collapse to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unstated premise that the chosen biological schemas are representative and constrained enough for the prompting strategy to generalize, plus the assumption that automatically generated benchmarks capture the distribution of real user queries. No free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Constrained schemas plus guardrails suffice to make few-shot LLM SQL generation reliable in the target domain
    Invoked when the abstract concludes that accurate query generation is achievable without training.
invented entities (1)
  • SANE (Schema-aware Natural-language Evaluation) no independent evidence
    purpose: New benchmark paradigm for domain-specific text-to-SQL evaluation
    Introduced as the core contribution; no independent evidence outside this work is provided in the abstract.

pith-pipeline@v0.9.1-grok · 5695 in / 1458 out tokens · 37892 ms · 2026-06-28T06:32:17.867469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 3 canonical work pages

  1. [1]

    Aisha Alansari et al.Large Language Models Hal- lucination: A Comprehensive Survey. 2026. arXiv: 2510.06265

  2. [2]

    Brown et al.Language Models are Few- Shot Learners

    Tom B. Brown et al.Language Models are Few- Shot Learners. 2020. arXiv:2005.14165

  3. [3]

    Ruisheng Cao et al.LGESQL: Line Graph En- hanced Text-to-SQL Model with Mixed Local and Non-Local Relations. 2021. arXiv:2106.01093

  4. [4]

    CellProfiler: image anal- ysis software for identifying and quantifying cell phenotypes

    Anne E. Carpenter et al. “CellProfiler: image anal- ysis software for identifying and quantifying cell phenotypes”. In:Genome Biology7.10 (2006), R100.DOI:10.1186/gb-2006-7-10-r100

  5. [5]

    Understanding Zero-Shot and Few- Shot Learning in LLMs

    Mia Cate. “Understanding Zero-Shot and Few- Shot Learning in LLMs”. In: (June 2023)

  6. [6]

    Dawei Gao et al.Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. 2023. arXiv:2308.15363

  7. [7]

    Aaron Grattafiori et al.The Llama 3 Herd of Mod- els. 2024. arXiv:2407.21783

  8. [8]

    Adam Tauman Kalai et al.Why Language Models Hallucinate. 2025. arXiv:2509.04664

  9. [9]

    Woosuk Kwon et al.Efficient Memory Manage- ment for Large Language Model Serving with PagedAttention. 2023. arXiv:2309.06180

  10. [10]

    Jinyang Li et al.Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. 2023. arXiv: 2305.03111

  11. [11]

    Mohammadreza Pourreza et al.DIN-SQL: Decom- posed In-Context Learning of Text-to-SQL with Self-Correction. 2023. arXiv:2304.11015

  12. [12]

    Colin Raffel et al.Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

  13. [13]

    Torsten Scholak et al.PICARD: Parsing Incre- mentally for Constrained Auto-Regressive Decod- ing from Language Models. 2021. arXiv:2109 . 05093

  14. [14]

    A robust natural language text- to-SQL generation framework with dynamic strate- gies based on LLMs

    Xiaodong Su et al. “A robust natural language text- to-SQL generation framework with dynamic strate- gies based on LLMs”. In:Scientific Reports16 (2026), p. 7892.DOI:10.1038/s41598-026- 39128-9

  15. [15]

    Evaluation of Few-Shot AI- Generated Feedback on Case Reports in Physical Therapy Education: Mixed Methods Study

    Hisaya Sudo et al. “Evaluation of Few-Shot AI- Generated Feedback on Case Reports in Physical Therapy Education: Mixed Methods Study”. In: JMIR Med Educ11 (Dec. 2025), e85614.ISSN: 2369-3762.DOI:10.2196/85614

  16. [16]

    Bailin Wang et al.RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. 2021. arXiv:1911.04942

  17. [17]

    Xiaojun Xu et al.SQLNet: Generating Structured Queries From Natural Language Without Rein- forcement Learning. 2017. arXiv:1711.04436

  18. [18]

    Tao Yu et al.CoSQL: A Conversational Text-to- SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases. 2019. arXiv: 1909.05378

  19. [19]

    Tao Yu et al.SParC: Cross-Domain Semantic Pars- ing in Context. 2019. arXiv:1906.02285

  20. [20]

    Tao Yu et al.Spider: A Large-Scale Human- Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. 2019. arXiv:1809.08887

  21. [21]

    Victor Zhong et al.Seq2SQL: Generating Struc- tured Queries from Natural Language using Rein- forcement Learning. 2017. arXiv:1709.00103. 5