Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study
Pith reviewed 2026-05-21 12:44 UTC · model grok-4.3
The pith
CLUES decomposes semantic uncertainty in clinical Text-to-SQL into separate ambiguity and instability scores.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By casting Text-to-SQL as a two-stage interpretations-to-answers process and representing the mapping as a bipartite semantic graph, the Schur complement of the associated matrix isolates an instability score; combined with a direct ambiguity score, the resulting decomposition improves failure prediction over Kernel Language Entropy on AmbigQA, SituatedQA, and a clinical Text-to-SQL benchmark while supplying a diagnostic unavailable from any single uncertainty value.
What carries the argument
The Schur complement of a bipartite semantic graph matrix, which extracts the instability component once interpretations are generated and answers are produced from them.
If this is right
- The high-ambiguity high-instability regime contains 51 percent of errors while covering only 25 percent of queries, enabling efficient triage.
- CLUES supplies a diagnostic decomposition of uncertainty that single-score methods cannot provide.
- Different uncertainty regimes map to distinct interventions: query refinement for ambiguity and model improvement for instability.
- The method remains competitive with existing approaches in deployment settings while adding the diagnostic split.
Where Pith is reading between the lines
- The same two-stage graph construction could be tested on non-SQL generation tasks to see whether the ambiguity-instability split remains useful outside clinical Text-to-SQL.
- In live systems the decomposition could automatically route high-ambiguity queries to a clarification step before any SQL is executed.
- If the Schur-complement instability score correlates with measurable model variation under prompt perturbation, it would strengthen the case for using it as a practical diagnostic.
Load-bearing premise
Semantic uncertainty can be cleanly decomposed into input ambiguity and model instability through a two-stage interpretations-to-answers process and the Schur complement of a bipartite semantic graph matrix without substantial information loss or confounding.
What would settle it
A direct head-to-head test on the same clinical Text-to-SQL benchmark showing that a single non-decomposed uncertainty score predicts errors at least as well as the separate ambiguity-plus-instability scores would falsify the added value of the decomposition.
Figures
read the original abstract
Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CLUES, a framework that models LLM Text-to-SQL as a two-stage interpretations-to-answers process and decomposes semantic uncertainty into an ambiguity score (from input interpretations) and an instability score (via the Schur complement of a bipartite semantic graph matrix). It reports improved failure prediction over Kernel Language Entropy on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark, along with a high-ambiguity/high-instability regime that contains 51% of errors while covering 25% of queries for efficient triage.
Significance. If the decomposition is shown to be robust, the work provides a diagnostically useful advance over single-score uncertainty measures by enabling targeted interventions (clarification for ambiguity, review for instability). The concrete 51%/25% triage statistic offers practical value for high-stakes clinical deployment, and the two-stage modeling plus graph-based derivation represent a structured approach to uncertainty that could generalize beyond Text-to-SQL.
major comments (1)
- [§3.2 (Instability Score Computation)] §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.
minor comments (3)
- [Abstract and §5] Abstract and §5: The reported improvements and 51%/25% coverage statistic are given without error bars, confidence intervals, or statistical significance tests; these should be added to allow assessment of robustness.
- [Evaluation Setup] Evaluation Setup: Clarify how the 'known interpretations' for the clinical Text-to-SQL benchmark were obtained or annotated, as this directly affects the independence of the ambiguity score.
- [Notation] Notation: The bipartite semantic graph matrix and its Schur complement would benefit from an explicit small example or pseudocode to illustrate the decomposition for readers unfamiliar with the linear-algebra step.
Simulated Author's Rebuttal
We thank the referee for their valuable comments, which have helped us improve the clarity of our work on disentangling ambiguity and instability in LLM outputs for clinical Text-to-SQL. Below we provide a point-by-point response to the major comment.
read point-by-point responses
-
Referee: §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.
Authors: We thank the referee for this insightful observation on the potential for residual dependence between the scores. The construction of the bipartite semantic graph uses LLM-generated interpretations and answers to form a matrix where the Schur complement specifically computes the instability component by eliminating the variance attributable to the interpretation diversity (ambiguity). This is grounded in the block matrix properties that isolate conditional dependencies. That said, we recognize the value of an empirical verification to rule out any shared variance from model artifacts. Accordingly, we will revise the manuscript to include an orthogonality check: we will report the correlation between the ambiguity and instability scores across the evaluation queries in both the general and clinical benchmarks. Additionally, we will describe a controlled analysis holding interpretations fixed to isolate instability variations. These revisions will be incorporated in the updated version to bolster the disentanglement claim. revision: yes
Circularity Check
No circularity: Schur complement decomposition is an independent linear-algebra derivation on model outputs
full rationale
The paper models Text-to-SQL as a two-stage interpretations-to-answers process and computes the instability score via the Schur complement of a bipartite semantic graph matrix whose nodes and edges are populated from the LLM's generated interpretations and answers. This is a direct algebraic operation (standard block-matrix reduction) applied to an externally constructed adjacency matrix; it does not redefine the target quantity in terms of itself, fit parameters to the evaluation set and then relabel the fit as a prediction, or rely on a self-citation chain for its justification. The ambiguity score is likewise extracted from interpretation diversity without circular dependence on the instability term. Because the derivation remains a deterministic function of the observed outputs rather than a tautological renaming or post-hoc fit, the central claim of disentanglement is self-contained and does not reduce to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLM Text-to-SQL generation can be accurately modeled as a two-stage process of interpretations followed by answers
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The instability score is computed via the Schur complement of a bipartite semantic graph matrix... S = W_RR - W_RI (W_II + εI)^{-1} W_IR
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We define the ambiguity score as the entropy over the set of interpretations... HI by applying the KLE framework to W_II
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
A dual-verification selective classifier using conformal prediction and geometric distance vetoes achieves reliable HIV suspicion triage in Spanish clinical notes by isolating a high-trust subset of predictions.
Reference graph
Works this paper leans on
-
[1]
Stephen Boyd and Lieven Vandenberghe. 2004.Convex Optimization. Cambridge University Press, Cam- bridge, England. Naihao Deng, Yuwei Chen, and Yue Zhang
work page 2004
-
[2]
Shreshth Desai and Greg Durrett
Re- cent advances in text-to-SQL: A survey of what we have and what we expect.arXiv preprint arXiv:2208.10099. Shreshth Desai and Greg Durrett
-
[3]
Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty
Distinguishing the knowable from the unknowable with language models.arXiv preprint arXiv:2402.03563. Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty
-
[4]
A bias-variance-covariance decom- position of kernel scores for generative models.arXiv preprint arXiv:2310.05833. Alex Kendall and Yarin Gal
-
[5]
InProceedings of the Clinical NLP Workshop
LG AI Research & KAIST at EHRSQL 2024: Self-training large lan- guage models with pseudo-labeled unanswerable questions for a reliable text-to-SQL system on EHRs. InProceedings of the Clinical NLP Workshop. Mathew J Koretsky, Maya Willey, and 1 others
work page 2024
-
[6]
BiomedSQL: Text-to-SQL for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar
-
[7]
Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs.arXiv preprint arXiv:2502.21239. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi
-
[8]
InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track
EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track. Joonho Lee, Seonghyeon Kim, Seoyoung Park, and Jinwoo Shin. 2024a. Improving uncertainty quantifi- cation in large language models via semantic embed- dings.arXiv preprint arXiv:2410.22685....
-
[9]
AmbigQA: Answering am- biguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online. Association for Computational Lin- guistics. Maximilian Mozes, Robert Bamler, and José Miguel Hernández-Lobato
work page 2020
-
[10]
arXiv preprint arXiv:2506.17296
Semantic uncertainty in advanced decoding methods for llm generation. arXiv preprint arXiv:2506.17296. Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen
-
[11]
Atlas: Open source software for observational data analysis. https://github. com/OHDSI/Atlas. Accessed: 2026-01-28. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bub...
work page 2026
-
[12]
gpt-oss-120b & gpt-oss-20b Model Card
gpt-oss-120b and gpt- oss-20b model card.Preprint, arXiv:2508.10925. Daeyoung Park, Suji Choi, Sunjae Kim, Jongwuk Lee, and Jaegul Choo
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Uncertainty-aware text-to- program for question answering on structured elec- tronic health records. InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929. Jaehee Park, Nan Zhang, Xiaohui Xiao, and 1 oth- ers
work page 1914
-
[14]
Über potenzreihen, die im innern des ein- heitskreises beschränkt sind.J. Reine Angew. Math., 1917(147):205–232. Yixuan Sun, Yichi Wang, and Yang Liu
work page 1917
-
[15]
Kimi K2: Open Agentic Intelligence
Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Qwen Team
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Fuzhen Zhang, editor. 2005.The schur complement and its applications, 2005 edition. Numerical Methods and Algorithms. Springer, New...
work page 2005
-
[18]
SituatedQA: In- corporating extra-linguistic contexts into QA. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Yuxin Zhang, Zinan Gao, Zhiming Xu, and Peng Cui
work page 2021
-
[19]
arXiv preprint arXiv:2506.09684
Inv-entropy: A fully probabilistic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684. Angelo Ziletti and Leonardo DAmbrosi
- [20]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.