Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Angelo Ziletti; Leonardo D'Ambrosi

arxiv: 2602.12015 · v2 · pith:T4CGHNTXnew · submitted 2026-02-12 · 💻 cs.CL

Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Angelo Ziletti , Leonardo D'Ambrosi This is my paper

Pith reviewed 2026-05-21 12:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords ambiguityinstabilitysemantic uncertaintyText-to-SQLclinical NLPfailure predictionlarge language modelsuncertainty decomposition

0 comments

The pith

CLUES decomposes semantic uncertainty in clinical Text-to-SQL into separate ambiguity and instability scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to separate two sources of output variation when large language models generate SQL from clinical text: ambiguity already present in the user's query, which calls for clarification, and instability inside the model itself, which calls for review or improvement. It models the generation process as a two-stage sequence from possible interpretations to final answers and extracts an ambiguity score directly from interpretation diversity while deriving an instability score from the Schur complement of the bipartite graph that links interpretations to answers. This split yields better prediction of which outputs will be wrong than single-score baselines such as Kernel Language Entropy, and it surfaces a high-ambiguity high-instability regime that contains 51 percent of errors inside only 25 percent of queries. A reader would care because the decomposition turns an undifferentiated uncertainty number into concrete next steps for safe deployment in clinical settings.

Core claim

By casting Text-to-SQL as a two-stage interpretations-to-answers process and representing the mapping as a bipartite semantic graph, the Schur complement of the associated matrix isolates an instability score; combined with a direct ambiguity score, the resulting decomposition improves failure prediction over Kernel Language Entropy on AmbigQA, SituatedQA, and a clinical Text-to-SQL benchmark while supplying a diagnostic unavailable from any single uncertainty value.

What carries the argument

The Schur complement of a bipartite semantic graph matrix, which extracts the instability component once interpretations are generated and answers are produced from them.

If this is right

The high-ambiguity high-instability regime contains 51 percent of errors while covering only 25 percent of queries, enabling efficient triage.
CLUES supplies a diagnostic decomposition of uncertainty that single-score methods cannot provide.
Different uncertainty regimes map to distinct interventions: query refinement for ambiguity and model improvement for instability.
The method remains competitive with existing approaches in deployment settings while adding the diagnostic split.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage graph construction could be tested on non-SQL generation tasks to see whether the ambiguity-instability split remains useful outside clinical Text-to-SQL.
In live systems the decomposition could automatically route high-ambiguity queries to a clarification step before any SQL is executed.
If the Schur-complement instability score correlates with measurable model variation under prompt perturbation, it would strengthen the case for using it as a practical diagnostic.

Load-bearing premise

Semantic uncertainty can be cleanly decomposed into input ambiguity and model instability through a two-stage interpretations-to-answers process and the Schur complement of a bipartite semantic graph matrix without substantial information loss or confounding.

What would settle it

A direct head-to-head test on the same clinical Text-to-SQL benchmark showing that a single non-decomposed uncertainty score predicts errors at least as well as the separate ambiguity-plus-instability scores would falsify the added value of the decomposition.

Figures

Figures reproduced from arXiv: 2602.12015 by Angelo Ziletti, Leonardo D'Ambrosi.

read the original abstract

Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CLUES gives a workable split of uncertainty into ambiguity and instability for clinical Text-to-SQL via Schur complement on a bipartite graph, with decent gains on failure prediction, though the separation may still carry some shared variance.

read the letter

The paper's core move is to treat Text-to-SQL as a two-stage process and use the Schur complement on a bipartite semantic graph to pull apart input ambiguity from model instability. That decomposition is the main thing worth noting, and it produces regimes that map to different fixes: clarification for ambiguity, review or retraining for instability. On the benchmarks they run, including a clinical one, the combined scores beat Kernel Language Entropy at spotting failures, and the high-ambiguity/high-instability slice picks up 51 percent of errors while hitting only 25 percent of queries. That triage number is the practical hook for deployment settings where you want targeted interventions rather than blanket human oversight. The graph-based approach is not a routine extension of earlier entropy work, so the specific construction counts as new for this task. The empirical results look honest enough on the surface, with concrete coverage stats and comparisons to a prior method. The soft spot is the risk that the bipartite graph, built from the same model outputs, leaves residual dependence between the two scores. If the complement does not fully remove cross-terms tied to shared model artifacts or clinical SQL patterns, the claimed disentanglement could be partial rather than clean. The abstract and available details do not yet show error bars or full sensitivity checks on the graph construction, so the strength of the separation is still provisional. This is aimed at people working on uncertainty estimation for structured generation, especially in clinical or safety-critical data access. A reader who cares about mapping uncertainty types to actions will get usable ideas even if the math needs more stress-testing. The work is coherent on its own terms and shows clear engagement with the problem, so it deserves a serious referee rather than a desk reject. I would send it out for review with requests for more detail on the graph normalization and independence checks.

Referee Report

1 major / 3 minor

Summary. The manuscript introduces CLUES, a framework that models LLM Text-to-SQL as a two-stage interpretations-to-answers process and decomposes semantic uncertainty into an ambiguity score (from input interpretations) and an instability score (via the Schur complement of a bipartite semantic graph matrix). It reports improved failure prediction over Kernel Language Entropy on AmbigQA/SituatedQA and a clinical Text-to-SQL benchmark, along with a high-ambiguity/high-instability regime that contains 51% of errors while covering 25% of queries for efficient triage.

Significance. If the decomposition is shown to be robust, the work provides a diagnostically useful advance over single-score uncertainty measures by enabling targeted interventions (clarification for ambiguity, review for instability). The concrete 51%/25% triage statistic offers practical value for high-stakes clinical deployment, and the two-stage modeling plus graph-based derivation represent a structured approach to uncertainty that could generalize beyond Text-to-SQL.

major comments (1)

[§3.2 (Instability Score Computation)] §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.

minor comments (3)

[Abstract and §5] Abstract and §5: The reported improvements and 51%/25% coverage statistic are given without error bars, confidence intervals, or statistical significance tests; these should be added to allow assessment of robustness.
[Evaluation Setup] Evaluation Setup: Clarify how the 'known interpretations' for the clinical Text-to-SQL benchmark were obtained or annotated, as this directly affects the independence of the ambiguity score.
[Notation] Notation: The bipartite semantic graph matrix and its Schur complement would benefit from an explicit small example or pseudocode to illustrate the decomposition for readers unfamiliar with the linear-algebra step.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their valuable comments, which have helped us improve the clarity of our work on disentangling ambiguity and instability in LLM outputs for clinical Text-to-SQL. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: §3.2 (Instability Score Computation): The claim that the Schur complement cleanly isolates model instability from input ambiguity requires that the bipartite semantic graph matrix construction and complement operation remove all cross-terms. Because the graph is built from the same LLM outputs used to generate interpretations on the evaluation queries, shared model artifacts in clinical SQL semantics may leave residual dependence; this risks the two scores sharing variance rather than being independent. This is load-bearing for the central disentanglement claim and the diagnostic regimes. An explicit orthogonality check (e.g., correlation between the two scores across queries or a controlled experiment with fixed ambiguity) is needed.

Authors: We thank the referee for this insightful observation on the potential for residual dependence between the scores. The construction of the bipartite semantic graph uses LLM-generated interpretations and answers to form a matrix where the Schur complement specifically computes the instability component by eliminating the variance attributable to the interpretation diversity (ambiguity). This is grounded in the block matrix properties that isolate conditional dependencies. That said, we recognize the value of an empirical verification to rule out any shared variance from model artifacts. Accordingly, we will revise the manuscript to include an orthogonality check: we will report the correlation between the ambiguity and instability scores across the evaluation queries in both the general and clinical benchmarks. Additionally, we will describe a controlled analysis holding interpretations fixed to isolate instability variations. These revisions will be incorporated in the updated version to bolster the disentanglement claim. revision: yes

Circularity Check

0 steps flagged

No circularity: Schur complement decomposition is an independent linear-algebra derivation on model outputs

full rationale

The paper models Text-to-SQL as a two-stage interpretations-to-answers process and computes the instability score via the Schur complement of a bipartite semantic graph matrix whose nodes and edges are populated from the LLM's generated interpretations and answers. This is a direct algebraic operation (standard block-matrix reduction) applied to an externally constructed adjacency matrix; it does not redefine the target quantity in terms of itself, fit parameters to the evaluation set and then relabel the fit as a prediction, or rely on a self-citation chain for its justification. The ambiguity score is likewise extracted from interpretation diversity without circular dependence on the instability term. Because the derivation remains a deterministic function of the observed outputs rather than a tautological renaming or post-hoc fit, the central claim of disentanglement is self-contained and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on modeling assumptions about LLM generation as a two-stage process and the validity of the bipartite graph representation for semantic uncertainty; no free parameters or invented entities are explicitly described in the abstract.

axioms (1)

domain assumption LLM Text-to-SQL generation can be accurately modeled as a two-stage process of interpretations followed by answers
This decomposition is required to separate ambiguity from instability as stated in the abstract.

pith-pipeline@v0.9.0 · 5707 in / 1308 out tokens · 44687 ms · 2026-05-21T12:44:43.564487+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The instability score is computed via the Schur complement of a bipartite semantic graph matrix... S = W_RR - W_RI (W_II + εI)^{-1} W_IR
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We define the ambiguity score as the entropy over the set of interpretations... HI by applying the KLE framework to W_II

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
cs.CL 2026-05 unverdicted novelty 5.0

A dual-verification selective classifier using conformal prediction and geometric distance vetoes achieves reliable HIV suspicion triage in Spanish clinical notes by isolating a high-trust subset of predictions.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

2004.Convex Optimization

Stephen Boyd and Lieven Vandenberghe. 2004.Convex Optimization. Cambridge University Press, Cam- bridge, England. Naihao Deng, Yuwei Chen, and Yue Zhang

work page 2004
[2]

Shreshth Desai and Greg Durrett

Re- cent advances in text-to-SQL: A survey of what we have and what we expect.arXiv preprint arXiv:2208.10099. Shreshth Desai and Greg Durrett

work page arXiv
[3]

Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

Distinguishing the knowable from the unknowable with language models.arXiv preprint arXiv:2402.03563. Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

work page arXiv
[4]

Alex Kendall and Yarin Gal

A bias-variance-covariance decom- position of kernel scores for generative models.arXiv preprint arXiv:2310.05833. Alex Kendall and Yarin Gal

work page arXiv
[5]

InProceedings of the Clinical NLP Workshop

LG AI Research & KAIST at EHRSQL 2024: Self-training large lan- guage models with pseudo-labeled unanswerable questions for a reliable text-to-SQL system on EHRs. InProceedings of the Clinical NLP Workshop. Mathew J Koretsky, Maya Willey, and 1 others

work page 2024
[6]

Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

BiomedSQL: Text-to-SQL for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

work page arXiv
[7]

Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs.arXiv preprint arXiv:2502.21239. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

work page arXiv
[8]

InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track

EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track. Joonho Lee, Seonghyeon Kim, Seoyoung Park, and Jinwoo Shin. 2024a. Improving uncertainty quantifi- cation in large language models via semantic embed- dings.arXiv preprint arXiv:2410.22685....

work page arXiv 2024
[9]

InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online

AmbigQA: Answering am- biguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online. Association for Computational Lin- guistics. Maximilian Mozes, Robert Bamler, and José Miguel Hernández-Lobato

work page 2020
[10]

arXiv preprint arXiv:2506.17296

Semantic uncertainty in advanced decoding methods for llm generation. arXiv preprint arXiv:2506.17296. Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen

work page arXiv
[11]

https://github

Atlas: Open source software for observational data analysis. https://github. com/OHDSI/Atlas. Accessed: 2026-01-28. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bub...

work page 2026
[12]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b and gpt- oss-20b model card.Preprint, arXiv:2508.10925. Daeyoung Park, Suji Choi, Sunjae Kim, Jongwuk Lee, and Jaegul Choo

work page internal anchor Pith review Pith/arXiv arXiv
[13]

InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929

Uncertainty-aware text-to- program for question answering on structured elec- tronic health records. InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929. Jaehee Park, Nan Zhang, Xiaohui Xiao, and 1 oth- ers

work page 1914
[14]

Reine Angew

Über potenzreihen, die im innern des ein- heitskreises beschränkt sind.J. Reine Angew. Math., 1917(147):205–232. Yixuan Sun, Yichi Wang, and Yang Liu

work page 1917
[15]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Qwen Team

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev

work page internal anchor Pith review Pith/arXiv arXiv
[17]

InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Fuzhen Zhang, editor. 2005.The schur complement and its applications, 2005 edition. Numerical Methods and Algorithms. Springer, New...

work page 2005
[18]

InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic

SituatedQA: In- corporating extra-linguistic contexts into QA. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Yuxin Zhang, Zinan Gao, Zhiming Xu, and Peng Cui

work page 2021
[19]

arXiv preprint arXiv:2506.09684

Inv-entropy: A fully probabilistic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684. Angelo Ziletti and Leonardo DAmbrosi

work page arXiv
[20]

Gener- ating patient cohorts from electronic health records using two-step retrieval-augmented text-to-sql gener- ation.Preprint, arXiv:2502.21107

work page arXiv

[1] [1]

2004.Convex Optimization

Stephen Boyd and Lieven Vandenberghe. 2004.Convex Optimization. Cambridge University Press, Cam- bridge, England. Naihao Deng, Yuwei Chen, and Yue Zhang

work page 2004

[2] [2]

Shreshth Desai and Greg Durrett

Re- cent advances in text-to-SQL: A survey of what we have and what we expect.arXiv preprint arXiv:2208.10099. Shreshth Desai and Greg Durrett

work page arXiv

[3] [3]

Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

Distinguishing the knowable from the unknowable with language models.arXiv preprint arXiv:2402.03563. Izhak Elazar, Roee Aharoni, Jonathan Berant, and Reut Tsarfaty

work page arXiv

[4] [4]

Alex Kendall and Yarin Gal

A bias-variance-covariance decom- position of kernel scores for generative models.arXiv preprint arXiv:2310.05833. Alex Kendall and Yarin Gal

work page arXiv

[5] [5]

InProceedings of the Clinical NLP Workshop

LG AI Research & KAIST at EHRSQL 2024: Self-training large lan- guage models with pseudo-labeled unanswerable questions for a reliable text-to-SQL system on EHRs. InProceedings of the Clinical NLP Workshop. Mathew J Koretsky, Maya Willey, and 1 others

work page 2024

[6] [6]

Biomedsql: Text-to-sql for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321, 2025

BiomedSQL: Text-to-SQL for scientific reasoning on biomedical knowledge bases.arXiv preprint arXiv:2505.20321. Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar

work page arXiv

[7] [7]

Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

Semantic volume: Quantifying and detecting both external and internal uncertainty in LLMs.arXiv preprint arXiv:2502.21239. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong C Lee, and Edward Choi

work page arXiv

[8] [8]

InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track

EHRSQL: A practical text-to-SQL benchmark for electronic health records. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Bench- marks Track. Joonho Lee, Seonghyeon Kim, Seoyoung Park, and Jinwoo Shin. 2024a. Improving uncertainty quantifi- cation in large language models via semantic embed- dings.arXiv preprint arXiv:2410.22685....

work page arXiv 2024

[9] [9]

InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online

AmbigQA: Answering am- biguous open-domain questions. InProceedings of the 2020 Conference on Empirical Methods in Nat- ural Language Processing (EMNLP), pages 5783– 5797, Online. Association for Computational Lin- guistics. Maximilian Mozes, Robert Bamler, and José Miguel Hernández-Lobato

work page 2020

[10] [10]

arXiv preprint arXiv:2506.17296

Semantic uncertainty in advanced decoding methods for llm generation. arXiv preprint arXiv:2506.17296. Alexander Nikitin, Jannik Kossen, Yarin Gal, and Pekka Marttinen

work page arXiv

[11] [11]

https://github

Atlas: Open source software for observational data analysis. https://github. com/OHDSI/Atlas. Accessed: 2026-01-28. OpenAI, :, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K. Arora, Yu Bai, Bowen Baker, Haiming Bao, Boaz Barak, Ally Bennett, Tyler Bertao, Nivedita Brett, Eugene Brevdo, Greg Brockman, Sebastien Bub...

work page 2026

[12] [12]

gpt-oss-120b & gpt-oss-20b Model Card

gpt-oss-120b and gpt- oss-20b model card.Preprint, arXiv:2508.10925. Daeyoung Park, Suji Choi, Sunjae Kim, Jongwuk Lee, and Jaegul Choo

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929

Uncertainty-aware text-to- program for question answering on structured elec- tronic health records. InProceedings of the Annual Meeting of the Association for Computational Lin- guistics (ACL), pages 1914–1929. Jaehee Park, Nan Zhang, Xiaohui Xiao, and 1 oth- ers

work page 1914

[14] [14]

Reine Angew

Über potenzreihen, die im innern des ein- heitskreises beschränkt sind.J. Reine Angew. Math., 1917(147):205–232. Yixuan Sun, Yichi Wang, and Yang Liu

work page 1917

[15] [15]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence. Preprint, arXiv:2507.20534. Qwen Team

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

Qwen3 Technical Report

Qwen3 technical report.Preprint, arXiv:2505.09388. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921

Spider: A large-scale human-labeled dataset for complex and cross-domain semantic pars- ing and text-to-SQL task. InProceedings of the Con- ference on Empirical Methods in Natural Language Processing (EMNLP), pages 3911–3921. Fuzhen Zhang, editor. 2005.The schur complement and its applications, 2005 edition. Numerical Methods and Algorithms. Springer, New...

work page 2005

[18] [18]

InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic

SituatedQA: In- corporating extra-linguistic contexts into QA. InPro- ceedings of the 2021 Conference on Empirical Meth- ods in Natural Language Processing, pages 7371– 7387, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Yuxin Zhang, Zinan Gao, Zhiming Xu, and Peng Cui

work page 2021

[19] [19]

arXiv preprint arXiv:2506.09684

Inv-entropy: A fully probabilistic framework for uncertainty quantification in language models. arXiv preprint arXiv:2506.09684. Angelo Ziletti and Leonardo DAmbrosi

work page arXiv

[20] [20]

Gener- ating patient cohorts from electronic health records using two-step retrieval-augmented text-to-sql gener- ation.Preprint, arXiv:2502.21107

work page arXiv