pith. sign in

arxiv: 2602.12015 · v2 · pith:T4CGHNTXnew · submitted 2026-02-12 · 💻 cs.CL

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Pith reviewed 2026-05-21 12:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords ambiguityinstabilitysemantic uncertaintyText-to-SQLclinical NLPfailure predictionlarge language modelsuncertainty decomposition
0
0 comments X

The pith

CLUES decomposes semantic uncertainty in clinical Text-to-SQL into separate ambiguity and instability scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to separate two sources of output variation when large language models generate SQL from clinical text: ambiguity already present in the user's query, which calls for clarification, and instability inside the model itself, which calls for review or improvement. It models the generation process as a two-stage sequence from possible interpretations to final answers and extracts an ambiguity score directly from interpretation diversity while deriving an instability score from the Schur complement of the bipartite graph that links interpretations to answers. This split yields better prediction of which outputs will be wrong than single-score baselines such as Kernel Language Entropy, and it surfaces a high-ambiguity high-instability regime that contains 51 percent of errors inside only 25 percent of queries. A reader would care because the decomposition turns an undifferentiated uncertainty number into concrete next steps for safe deployment in clinical settings.

Core claim

By casting Text-to-SQL as a two-stage interpretations-to-answers process and representing the mapping as a bipartite semantic graph, the Schur complement of the associated matrix isolates an instability score; combined with a direct ambiguity score, the resulting decomposition improves failure prediction over Kernel Language Entropy on AmbigQA, SituatedQA, and a clinical Text-to-SQL benchmark while supplying a diagnostic unavailable from any single uncertainty value.

What carries the argument

The Schur complement of a bipartite semantic graph matrix, which extracts the instability component once interpretations are generated and answers are produced from them.

If this is right

  • The high-ambiguity high-instability regime contains 51 percent of errors while covering only 25 percent of queries, enabling efficient triage.
  • CLUES supplies a diagnostic decomposition of uncertainty that single-score methods cannot provide.
  • Different uncertainty regimes map to distinct interventions: query refinement for ambiguity and model improvement for instability.
  • The method remains competitive with existing approaches in deployment settings while adding the diagnostic split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-stage graph construction could be tested on non-SQL generation tasks to see whether the ambiguity-instability split remains useful outside clinical Text-to-SQL.
  • In live systems the decomposition could automatically route high-ambiguity queries to a clarification step before any SQL is executed.
  • If the Schur-complement instability score correlates with measurable model variation under prompt perturbation, it would strengthen the case for using it as a practical diagnostic.

Load-bearing premise

Semantic uncertainty can be cleanly decomposed into input ambiguity and model instability through a two-stage interpretations-to-answers process and the Schur complement of a bipartite semantic graph matrix without substantial information loss or confounding.

What would settle it

A direct head-to-head test on the same clinical Text-to-SQL benchmark showing that a single non-decomposed uncertainty score predicts errors at least as well as the separate ambiguity-plus-instability scores would falsify the added value of the decomposition.

Figures

Figures reproduced from arXiv: 2602.12015 by Angelo Ziletti, Leonardo D'Ambrosi.

Figure 1
Figure 1. Figure 1: Examples of uncertainty regimes in CLUES. (Left) Regime II, Ambiguity: Interpretations (brown) are [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
read the original abstract

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces CLUES, a framework that models LLM Text-to-SQL as a two-stage interpretations-to-answers process and decomposes semantic uncertainty into an ambiguity score (from input interpretations) and an instability score (via the Schur complement of a bipartite semantic graph matrix). It reports improved failure prediction over Kernel Language Entropy on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark, along with a high-ambiguity/high-instability regime that contains 51% of errors while covering 25% of queries for efficient triage.

Significance. If the decomposition is shown to be robust, the work provides a diagnostically useful advance over single-score uncertainty measures by enabling targeted interventions (clarification for ambiguity, review for instability). The concrete 51%/25% triage statistic offers practical value for high-stakes clinical deployment, and the two-stage modeling plus graph-based derivation represent a structured approach to uncertainty that could generalize beyond Text-to-SQL.

major comments (1)
  1. [§3.2 (Instability Score Computation)] §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.
minor comments (3)
  1. [Abstract and §5] Abstract and §5: The reported improvements and 51%/25% coverage statistic are given without error bars, confidence intervals, or statistical significance tests; these should be added to allow assessment of robustness.
  2. [Evaluation Setup] Evaluation Setup: Clarify how the 'known interpretations' for the clinical Text-to-SQL benchmark were obtained or annotated, as this directly affects the independence of the ambiguity score.
  3. [Notation] Notation: The bipartite semantic graph matrix and its Schur complement would benefit from an explicit small example or pseudocode to illustrate the decomposition for readers unfamiliar with the linear-algebra step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable comments, which have helped us improve the clarity of our work on disentangling ambiguity and instability in LLM outputs for clinical Text-to-SQL. Below we provide a point-by-point response to the major comment.

read point-by-point responses
  1. Referee: §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.

    Authors: We thank the referee for this insightful observation on the potential for residual dependence between the scores. The construction of the bipartite semantic graph uses LLM-generated interpretations and answers to form a matrix where the Schur complement specifically computes the instability component by eliminating the variance attributable to the interpretation diversity (ambiguity). This is grounded in the block matrix properties that isolate conditional dependencies. That said, we recognize the value of an empirical verification to rule out any shared variance from model artifacts. Accordingly, we will revise the manuscript to include an orthogonality check: we will report the correlation between the ambiguity and instability scores across the evaluation queries in both the general and clinical benchmarks. Additionally, we will describe a controlled analysis holding interpretations fixed to isolate instability variations. These revisions will be incorporated in the updated version to bolster the disentanglement claim. revision: yes

Circularity Check

0 steps flagged

No circularity: Schur complement decomposition is an independent linear-algebra derivation on model outputs

full rationale

The paper models Text-to-SQL as a two-stage interpretations-to-answers process and computes the instability score via the Schur complement of a bipartite semantic graph matrix whose nodes and edges are populated from the LLM's generated interpretations and answers. This is a direct algebraic operation (standard block-matrix reduction) applied to an externally constructed adjacency matrix; it does not redefine the target quantity in terms of itself, fit parameters to the evaluation set and then relabel the fit as a prediction, or rely on a self-citation chain for its justification. The ambiguity score is likewise extracted from interpretation diversity without circular dependence on the instability term. Because the derivation remains a deterministic function of the observed outputs rather than a tautological renaming or post-hoc fit, the central claim of disentanglement is self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on modeling assumptions about LLM generation as a two-stage process and the validity of the bipartite graph representation for semantic uncertainty; no free parameters or invented entities are explicitly described in the abstract.

axioms (1)
  • domain assumption LLM Text-to-SQL generation can be accurately modeled as a two-stage process of interpretations followed by answers
    This decomposition is required to separate ambiguity from instability as stated in the abstract.

pith-pipeline@v0.9.0 · 5707 in / 1308 out tokens · 44687 ms · 2026-05-21T12:44:43.564487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification

    cs.CL 2026-05 unverdicted novelty 5.0

    A dual-verification selective classifier using conformal prediction and geometric distance vetoes achieves reliable HIV suspicion triage in Spanish clinical notes by isolating a high-trust subset of predictions.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    2004.Convex Optimization

    Stephen Boyd and Lieven Vandenberghe. 2004.Convex Optimization. Cambridge University Press, Cam- bridge, England. Naihao Deng, Yuwei Chen, and Yue Zhang

  2. [2]

    Shreshth Desai and Greg Durrett

    Re- cent advances in text-to-SQL: A survey of what we have and what we expect.arXiv preprint arXiv:2208.10099. Shreshth Desai and Greg Durrett

  3. [3]

    Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

    Distinguishing the knowable from the unknowable with language models.arXiv preprint arXiv:2402.03563. Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

  4. [4]

    Alex Kendall and Yarin Gal

    A bias-variance-covariance decom- position of kernel scores for generative models.arXiv preprint arXiv:2310.05833. Alex Kendall and Yarin Gal

  5. [5]

    InProceedings of the Clinical NLP Workshop

    LG AI Research & KAIST at EHRSQL 2024: Self-training large lan- guage models with pseudo-labeled unanswerable questions for a reliable text-to-SQL system on EHRs. InProceedings of the Clinical NLP Workshop. Mathew J Koretsky, Maya Willey, and 1 others

  6. [6]

    Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

    BiomedSQL: Text-to-SQL for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

  7. [7]

    Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

    Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs.arXiv preprint arXiv:2502.21239. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

  8. [8]

    InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track

    EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track. Joonho Lee, Seonghyeon Kim, Seoyoung Park, and Jinwoo Shin. 2024a. Improving uncertainty quantifi- cation in large language models via semantic embed- dings.arXiv preprint arXiv:2410.22685....

  9. [9]

    InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online

    AmbigQA: Answering am- biguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online. Association for Computational Lin- guistics. Maximilian Mozes, Robert Bamler, and José Miguel Hernández-Lobato

  10. [10]

    arXiv preprint arXiv:2506.17296

    Semantic uncertainty in advanced decoding methods for llm generation. arXiv preprint arXiv:2506.17296. Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen

  11. [11]

    https://github

    Atlas: Open source software for observational data analysis. https://github. com/OHDSI/Atlas. Accessed: 2026-01-28. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bub...

  12. [12]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b and gpt- oss-20b model card.Preprint, arXiv:2508.10925. Daeyoung Park, Suji Choi, Sunjae Kim, Jongwuk Lee, and Jaegul Choo

  13. [13]

    InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929

    Uncertainty-aware text-to- program for question answering on structured elec- tronic health records. InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929. Jaehee Park, Nan Zhang, Xiaohui Xiao, and 1 oth- ers

  14. [14]

    Reine Angew

    Über potenzreihen, die im innern des ein- heitskreises beschränkt sind.J. Reine Angew. Math., 1917(147):205–232. Yixuan Sun, Yichi Wang, and Yang Liu

  15. [15]

    Kimi K2: Open Agentic Intelligence

    Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Qwen Team

  16. [16]

    Qwen3 Technical Report

    Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev

  17. [17]

    InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Fuzhen Zhang, editor. 2005.The schur complement and its applications, 2005 edition. Numerical Methods and Algorithms. Springer, New...

  18. [18]

    InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic

    SituatedQA: In- corporating extra-linguistic contexts into QA. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Yuxin Zhang, Zinan Gao, Zhiming Xu, and Peng Cui

  19. [19]

    arXiv preprint arXiv:2506.09684

    Inv-entropy: A fully probabilistic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684. Angelo Ziletti and Leonardo DAmbrosi

  20. [20]

    Gener- ating patient cohorts from electronic health records using two-step retrieval-augmented text-to-sql gener- ation.Preprint, arXiv:2502.21107