CLARITY: A Framework and Benchmark for Conversational Language Ambiguity and Unanswerability in Interactive NL2SQL Systems
Pith reviewed 2026-05-08 11:41 UTC · model grok-4.3
The pith
Leading NL2SQL systems suffer significant performance drops when queries contain multiple sources of ambiguity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Clarity is a framework that automatically generates an NL2SQL benchmark containing multi-faceted ambiguities and diverse user behaviors across single- and multi-turn settings. Through a constraint-driven pipeline, it transforms executable SQL into ambiguous queries augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on existing datasets shows that leading NL2SQL systems suffer significant performance degradation under multi-faceted ambiguity, detecting it but struggling to localize and resolve the underlying schema-level sources.
What carries the argument
The Clarity framework, which uses a constraint-driven pipeline to transform executable SQL queries into ambiguous natural language versions with added conversational continuations and schema metadata.
Load-bearing premise
The automatically generated multi-faceted ambiguities and conversational continuations represent realistic failure modes that occur in real interactive industry NL2SQL deployments.
What would settle it
Comparing the ambiguity patterns and system failure modes in the generated benchmark against actual user interaction logs from deployed NL2SQL applications in industry settings.
Figures
read the original abstract
NL2SQL systems deployed in industry settings often encounter ambiguous or unanswerable queries, particularly in interactive scenarios with incomplete user clarification. Existing benchmarks typically assume a single source of ambiguity and rely on user interaction for resolution, overlooking realistic failure modes. We introduce Clarity, a framework for automatically generating an NL2SQL benchmark with multi-faceted ambiguities and diverse user behaviors across both single- and multi-turn settings. Using a constraint-driven pipeline, Clarity transforms executable SQL into ambiguous queries, augmented with grounded conversational continuations and schema-level metadata. Empirical evaluation on Spider and BIRD shows that leading NL2SQL systems, including those based on strong LLMs, suffer significant performance degradation under multi-faceted ambiguity. While these systems often detect ambiguity, they struggle to accurately localize and resolve the underlying schema-level sources. Our results highlight the need for more robust ambiguity detection and resolution in industry-grade NL2SQL systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents CLARITY, a framework for generating benchmarks for conversational NL2SQL with multi-faceted ambiguities and unanswerability. It uses a constraint-driven pipeline to create ambiguous natural language queries from executable SQL, along with grounded conversational continuations and schema metadata. Evaluation on Spider and BIRD datasets demonstrates that state-of-the-art NL2SQL systems, including LLM-based ones, experience significant performance drops when facing these ambiguities, often detecting them but failing to localize schema-level sources accurately.
Significance. If the generated benchmark is shown to be representative of real interactive scenarios, this work would be significant for the NL2SQL field by exposing limitations in current systems' ambiguity handling and motivating more robust designs for industry deployments. The automatic, scalable generation approach is a strength that could support future benchmark expansion.
major comments (2)
- [§3] §3 (Framework and Benchmark Generation): The constraint-driven pipeline is described as transforming executable SQL into ambiguous queries with multi-faceted ambiguities and grounded continuations, but no validation is reported (e.g., human evaluation, inter-annotator agreement, or distributional comparison against real industry NL2SQL logs) to confirm that the synthetic cases avoid artifacts and match realistic failure modes. This is load-bearing for the central claim of performance degradation and poor localization.
- [§5] §5 (Empirical Evaluation): The results claim significant degradation on Spider and BIRD under multi-faceted ambiguity, yet the manuscript provides no details on specific error metrics, controls for artificiality of generated queries, breakdown by ambiguity type, or statistical tests. This makes it difficult to assess whether the localization/resolution failures are general or benchmark-specific.
minor comments (1)
- The abstract could more precisely quantify the scale of the generated benchmark (number of examples, ambiguity types) to help readers assess the evaluation's scope.
Simulated Author's Rebuttal
Thank you for your thoughtful and constructive review of our manuscript on the CLARITY framework. We appreciate the feedback highlighting the need for stronger validation of the benchmark generation process and more granular reporting in the empirical evaluation. We address each major comment below and commit to revisions that will enhance the rigor and clarity of the work without altering its core contributions.
read point-by-point responses
-
Referee: [§3] §3 (Framework and Benchmark Generation): The constraint-driven pipeline is described as transforming executable SQL into ambiguous queries with multi-faceted ambiguities and grounded continuations, but no validation is reported (e.g., human evaluation, inter-annotator agreement, or distributional comparison against real industry NL2SQL logs) to confirm that the synthetic cases avoid artifacts and match realistic failure modes. This is load-bearing for the central claim of performance degradation and poor localization.
Authors: We agree that the absence of explicit validation for the generated benchmark is a limitation that weakens support for our central claims. The constraint-driven pipeline was designed to systematically derive multi-faceted ambiguities from executable SQL and schema metadata in a way that mirrors observed failure modes in interactive NL2SQL, but the manuscript does not report human evaluation, inter-annotator agreement, or comparisons to real industry logs. In the revised manuscript, we will add a dedicated validation subsection in §3, including a human study on a representative sample of generated queries (with inter-annotator agreement metrics) and qualitative analysis of realism. Where feasible, we will also incorporate distributional comparisons using publicly available NL2SQL interaction datasets. These additions will directly address the load-bearing concern. revision: yes
-
Referee: [§5] §5 (Empirical Evaluation): The results claim significant degradation on Spider and BIRD under multi-faceted ambiguity, yet the manuscript provides no details on specific error metrics, controls for artificiality of generated queries, breakdown by ambiguity type, or statistical tests. This makes it difficult to assess whether the localization/resolution failures are general or benchmark-specific.
Authors: We concur that the empirical evaluation section would be strengthened by additional details and controls. The current manuscript reports aggregate performance degradation and localization issues across Spider and BIRD, but lacks breakdowns, specific metrics beyond high-level claims, explicit controls for artificiality, and statistical tests. In the revision, we will expand §5 to include: (1) breakdowns by ambiguity type (e.g., schema-level, value-level, and multi-faceted combinations); (2) specific metrics such as exact match, execution accuracy, and ambiguity detection F1; (3) controls comparing results on generated ambiguous queries versus their original unambiguous Spider/BIRD counterparts; and (4) statistical significance tests (e.g., paired t-tests or McNemar's test) for the observed degradations. This will clarify the generality of the localization and resolution failures. revision: yes
Circularity Check
No circularity: benchmark generation and evaluation are independent
full rationale
The paper describes a constraint-driven pipeline that transforms executable SQL into ambiguous queries with grounded continuations, then reports empirical results on Spider and BIRD. No equations, fitted parameters, or central claims reduce by construction to the paper's own inputs, self-citations, or prior author ansatzes. The performance degradation findings are direct observations on the generated test cases rather than tautological predictions, and the framework is presented as a standalone contribution without load-bearing self-referential derivations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing benchmarks assume a single source of ambiguity and rely on user interaction for resolution
invented entities (1)
-
Clarity framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
nvbench 2.0: Resolving ambiguity in text- to-visualization through stepwise reasoning.arXiv preprint arXiv:2503.12880. Mihai Nad ˘as, , Laura Dio s, an, and Andreea Tomescu
-
[2]
Simone Papicchio, Luca Cagliero, and Paolo Papotti
Synthetic data generation using large language models: Advances in text and code.IEEE Access. Simone Papicchio, Luca Cagliero, and Paolo Papotti
-
[3]
Squab: Evaluating llm robustness to ambigu- ous and unanswerable questions in semantic parsing. InProceedings of the 2025 Conference on Empiri- cal Methods in Natural Language Processing, pages 17937–17957. Mohammadreza Pourreza and Davood Rafiei. 2023. Din-sql: Decomposed in-context learning of text- to-sql with self-correction. InAdvances in Neural Info...
-
[4]
InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714
Know what i don’t know: Handling ambiguous and unknown questions for text-to-sql. InFindings of the Association for Computational Linguistics: ACL 2023, pages 5701–5714. Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingn- ing Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018. Spider: A large-scale hu...
work page 2023
-
[5]
Sphinteract: Resolving ambiguities in nl2sql through user interaction.Proceedings of the VLDB Endowment, 18(4):1145–1158. A The CLARITYFramework - Appendix This appendix presents a taxonomy of A/U in CLARITY. We define core A/U concepts (Table 8), characterize user clarification behaviors in multi-turn interactions (Table 9), provide representative conver...
-
[6]
- The column of interest and selected columns must share common tokens, words or segments
**Group Selection**: Identify groups of columns that are lexically similar to the given column of interest - Identify the columns that have similar names and writing style or structure to the column of interest and group them. - The column of interest and selected columns must share common tokens, words or segments. - If there are many column names which ...
-
[7]
**Term Generation**: Generate an ambiguous term for the column of interest that can likely be asked by a human database manager and cause ambiguity with the selected columns group using the following rules: - The ambiguous term should be lexically similar to column of interest in terms of wording, length, and style. - The ambiguous term should NOT be an e...
-
[8]
**Constraints**: - The ambiguous term and selected group must be specific to the column of interest, i.e., the ambiguous term is generated for the column of interest which causes ambiguity with the selected columns group. - For the ambiguous term, DO NOT use tokens, words or segments from columns of the selected group that do not appear in the column of i...
-
[9]
**Format**: Strictly generate the response using the given output format. DO NOT generate any other content or text. ### **Examples** Example 1: Column of Interest: "Monthly Orders" Given List of Similar Columns: ["Total Orders", "Yearly Orders", "Order Dispatch Date" , "Order Invoice ID"] History: None Selected Group: ["Total Orders", "Yearly Orders"] Am...
-
[10]
* Focus on the conditions of the ambiguity criterion to identify valid ambiguous term
**Analyze**: * Compare the ‘Given Term‘ with each column of the ‘Given Column List‘ using the given ‘Ambiguity‘ definition. * Focus on the conditions of the ambiguity criterion to identify valid ambiguous term
-
[11]
valid"‘ if the term is ambiguous. Otherwise, its ‘
**Decide**: Outcome is ‘"valid"‘ if the term is ambiguous. Otherwise, its ‘"invalid"‘
-
[12]
**Explain**: Provide a brief reason for your decision
-
[13]
* The reason should be specific to the given term and list of columns
**Constraints**: * Base your reason strictly on the provided definitions. * The reason should be specific to the given term and list of columns. * Do not use any external knowledge or make assumptions
-
[14]
Your task is to generate a natural language query from the given SQL query
**Format**: Structure your entire response according to the ‘Output Format‘ instructions ### **Inputs** Given Term: {term} Given List of Columns: {dataset_columns} ### **Outputs** **Your response must follow these instructions**: {format_instructions} NL Query Generator Prompt You are a SQL expert specializing in natural language (NL)-to-SQL translation a...
-
[15]
**Natural Language Query Generation**: - Using the given SQL query, write a natural language query that would generate the given SQL query - Strictly use the column names mentioned in the SQL query and its context to generate the natural language query. Do not make any assumptions - All the column names mentioned in the SQL query must be explicitly mentio...
-
[16]
Use these to generate an ambiguous query where the given column is replaced with the ambiguous term
**Ambiguous Query Generation**: - You are given a list of column names and corresponding ambiguous terms. Use these to generate an ambiguous query where the given column is replaced with the ambiguous term. - strictly use the given column name to replace it with the corresponding ambiguous term for the ambiguous query. - Use the generated natural language...
-
[17]
**Constraints**: - The generated natural language query and its ambiguous variant should correspond to the same underlying SQL query. The aim of the queries should not change. - The generated natural language query should be written in a human-like style and should not use the exact column names format found in the SQL query. Use the provided examples as ...
-
[18]
**Format**: - Strictly generate response requested in the output instructions. - DO NOT GENERATE any other supplementary explanation or description. Strictly generate output as mentioned in the output format ### **Examples** Example 1: Given SQL Query: 'SELECT count("Monthly Orders"), "Employee ID" FROM my_data GROUP BY "Employee ID" ORDER BY Employee Nam...
work page 2010
-
[19]
**Consistency Assessment**: The ambiguous query should be compared with the original SQL query for assessment. - The ambiguous term(s) and corresponding true column(s) is given that resolves the ambiguity in the given query. - When the ambiguous term(s) is replaced by the true column(s), the resulting query should be consistent with the underlying SQL que...
-
[20]
**Response Generation**: After replacing the ambiguous term(s) with true column(s) in the given ambiguous query, if the resulting query corresponds to underlying SQL query then the outcome is 'valid'. Otherwise, it is'invalid'. - The valid assessment must address the two consistency condition. - Provide a reason for the assessment as well
-
[21]
Response should be specific to given queries, ambiguous term and true column information
**Constraints**: - Base your reason strictly on the provided instructions. Response should be specific to given queries, ambiguous term and true column information. - Do not use any external knowledge or make assumptions. - Refrain from providing a SQL language-based response and instead, provide a natural language response
-
[22]
**Response Format**: Strictly generate the response using the given output format. DO NOT generate any other content or text. ### **Examples** Example 1: Ambiguous Natural Language Query: 'show count of orders by employee sorted on employee name' Ambiguous terms: ['orders','employee'] True Columns: ['Monthly Orders','Employee ID'] Underlying SQL Query: SE...
work page 2010
-
[23]
orders" term is lexically similar to
**Lexical Column Ambiguity**: The query includes tokens or terms that refer to a column, but there is no exact match with any column in the given schema. However, the term or token is lexically similar to two or more columns of the schema, making it unclear which column the query is referring to. - Lexical similarity means that the term shares similar nam...
-
[24]
issue" is semantically similar to
**Semantic Column Ambiguity**: The query includes tokens or terms that refer to a column, but there is no exact match with any column in the schema. However, the term or token is semantically similar to two or more columns of the schema, making it is unclear which column the query is referring to. - Semantic Similarity means that the term represents highe...
-
[25]
The term or token is neither lexically nor semantically similar to any columns in the schema
**Column Confusion**: The query contains tokens or terms that cannot be mapped to any columns in the schema. The term or token is neither lexically nor semantically similar to any columns in the schema. The column does not exists in the schema
-
[26]
**Unambiguous**: The query contains tokens or terms referring to a column, with each term being mapped to a single column in the schema. Unambiguous queries can be translated to SQL query without any human intervention ### **Instructions**: You are given a natural language query that may belong to one of the given categories. Your task is to identify term...
-
[27]
Analyse the natural language query to identify the terms that refers to column in the schema taking into account the given ***Type of Categories** - Schema can contain multiple tables, compare the terms against each table - Ignore the case (uppercase or lowercase), singular plural and minor variations during the comparison. For example: -'Customer Count',...
-
[28]
Identify the list of terms that refer to a column but do not exactly match any column names in the schema. - There could be more than one terms whose exact match does not exist - If exact match with the schema columns exist for the identified terms then use 'None Found' for response
-
[29]
The reason should be specific to the given schema
Provide a reason for the assessment. The reason should be specific to the given schema. **Format**: Strictly generate the response using the given output format, recording terms and reason seperately. DO NOT generate any other content or text. ### **Inputs** Datbase Schema: {db_schema} Natural Language Question: {ac_query} ### **Outputs** {format_instruct...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.