IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages
Pith reviewed 2026-05-10 13:17 UTC · model grok-4.3
The pith
A new benchmark built from Indian government data shows LLMs drop 9 percent in Text-to-SQL accuracy on Indic languages versus English.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IndicDB supplies 20 databases containing 237 tables drawn from Indian open-data platforms, converted into relational form with an average of 11.85 tables per database and join depths reaching six. A value-aware and join-enforced pipeline produces 15,617 tasks across seven language variants. When state-of-the-art models are tested on these tasks, performance falls by 9.00 percent from English to the Indic languages, with the drop traced to harder schema linking, greater structural ambiguity, and reduced access to external knowledge.
What carries the argument
The iterative three-agent framework (Architect, Auditor, Refiner) that turns denormalized government tables into high-density relational schemas while enforcing joins and calibrating task difficulty.
If this is right
- Schema linking remains the dominant source of error when queries move from English to Indic languages.
- Structural ambiguity increases measurably in Indic-language versions of the same database.
- Limited external knowledge hurts performance more on Indic queries than on English ones.
- The benchmark supplies a concrete test set for measuring progress on cross-lingual semantic parsing.
- Real administrative schemas expose complexity that simplified English benchmarks hide.
Where Pith is reading between the lines
- The same gap may appear in other semantic-parsing tasks that rely on schema understanding.
- Training data that includes Indic-language examples of database schemas could shrink the observed drop.
- Extending the benchmark to additional Indic languages or other government domains would test whether the gap generalizes.
- System builders may need language-specific schema-linking modules rather than relying on general multilingual pretraining.
Load-bearing premise
The three-agent framework produces tasks and schemas that faithfully represent real-world multilingual Text-to-SQL challenges without introducing generation artifacts or biases.
What would settle it
An independent collection of Indic-language Text-to-SQL tasks created without the three-agent pipeline that shows no accuracy difference from English would falsify the reported gap.
Figures
read the original abstract
While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an "Indic Gap" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces IndicDB, a multilingual Text-to-SQL benchmark with 15,617 tasks spanning English, Hindi, and five other Indic languages. It derives 20 realistic relational databases (237 tables total, avg. 11.85 tables/DB, joins to depth 6) from NDAP and IDP open data using an iterative three-agent LLM pipeline (Architect-Auditor-Refiner) described as value-aware, difficulty-calibrated, and join-enforced. Evaluation of models including DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3 reports a 9.00% average performance drop from English to Indic languages, interpreted as an 'Indic Gap' caused by harder schema linking, increased structural ambiguity, and limited external knowledge.
Significance. If the generated tasks are free of language-specific artifacts, IndicDB would be a valuable addition to Text-to-SQL resources by providing the first large-scale benchmark on real Indian administrative data in multiple Indic languages. The scale, relational complexity, and grounding in public government datasets distinguish it from prior synthetic or English-centric benchmarks and could usefully guide targeted improvements in cross-lingual semantic parsing.
major comments (2)
- The central claim of a 9.00% 'Indic Gap' is load-bearing on the assumption that the three-agent (Architect-Auditor-Refiner) pipeline produces tasks of equivalent difficulty and structure across languages. Because the agents are LLMs whose English performance substantially exceeds their Indic performance, systematic differences in query formulation, difficulty calibration, or join enforcement could artifactually inflate the measured gap. The manuscript provides no human validation, cross-lingual ablation, or quantitative comparison of task statistics (e.g., join depth distributions, schema-linking complexity) between English and Indic variants to rule out generation bias.
- The reported 9.00% performance drop is presented without error bars, confidence intervals, or statistical significance tests. The abstract and evaluation description also omit full per-model, per-language result tables or breakdowns, preventing assessment of whether the gap is consistent across models or driven by particular languages or query types.
minor comments (2)
- Clarify the exact set of seven linguistic variants evaluated (the abstract lists 'English, Hindi, and five Indic languages' but does not enumerate the five additional languages).
- The anonymous code/data link limits reproducibility assessment; consider providing a non-anonymous repository or detailed data-release plan in the camera-ready version.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the empirical support for our claims.
read point-by-point responses
-
Referee: The central claim of a 9.00% 'Indic Gap' is load-bearing on the assumption that the three-agent (Architect-Auditor-Refiner) pipeline produces tasks of equivalent difficulty and structure across languages. Because the agents are LLMs whose English performance substantially exceeds their Indic performance, systematic differences in query formulation, difficulty calibration, or join enforcement could artifactually inflate the measured gap. The manuscript provides no human validation, cross-lingual ablation, or quantitative comparison of task statistics (e.g., join depth distributions, schema-linking complexity) between English and Indic variants to rule out generation bias.
Authors: We acknowledge that an LLM-based generation pipeline carries a risk of language-dependent artifacts. The three-agent design attempts to mitigate this by using language-agnostic structural constraints (value-aware schema construction, validity auditing, and join-enforced refinement), but we agree this is insufficient without explicit verification. In the revised manuscript we will add: (i) side-by-side quantitative distributions of join depth, number of tables per query, predicate count, and schema-linking complexity metrics for English versus each Indic language; (ii) a cross-lingual ablation in which English-generated queries are machine-translated and re-run through the Auditor-Refiner stages for direct comparison; and (iii) a human validation study on a stratified sample of 300 tasks (50 per language) in which native speakers rate difficulty, naturalness, and structural equivalence. These additions will allow readers to assess whether the observed gap is attributable to model limitations rather than data generation bias. revision: yes
-
Referee: The reported 9.00% performance drop is presented without error bars, confidence intervals, or statistical significance tests. The abstract and evaluation description also omit full per-model, per-language result tables or breakdowns, preventing assessment of whether the gap is consistent across models or driven by particular languages or query types.
Authors: We agree that the current presentation lacks necessary statistical detail and granularity. The revised manuscript will include: error bars and 95% confidence intervals on all aggregate and per-language scores; paired statistical significance tests (Wilcoxon signed-rank and McNemar tests) for the English-to-Indic gap both overall and per model; and expanded evaluation tables reporting exact accuracy for every model-language pair together with breakdowns by query type (join depth 0-2, 3-4, 5-6; number of tables; presence of value linking). The abstract will be updated to note that detailed per-model and per-language results, including statistical tests, appear in the main text and appendix. revision: yes
Circularity Check
No circularity: empirical benchmark construction and external evaluation
full rationale
The paper constructs IndicDB by applying an iterative three-agent (Architect-Auditor-Refiner) pipeline to external NDAP and IDP government data sources, then evaluates off-the-shelf LLMs (DeepSeek, LLaMA, Qwen, etc.) on the resulting 15,617 tasks. The reported 9% English-to-Indic drop and 'Indic Gap' attribution are direct empirical measurements, not derived quantities. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the described chain; the methodology is self-contained against external models and real-world data without reducing the central claims to the generation process by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Parse user questions: Use natural language processing (NLP) techniques to parse user questions and extract query requirements and conditions
-
[2]
Analyze database schema: Based on the database schema, understand the fields and relationships of the table, and build the basic framework of the SQL query
-
[3]
Check sample data: Analyze the data characteristics based on the first three rows of the table values to help determine how to construct query conditions and filter results
-
[4]
Generate SQL query: Based on user questions, query requirements and condi- tions, database schema, and sample data, build a complete SQL query
-
[5]
Verification and optimization: Check whether the generated SQL query is logical and optimize it if necessary. ### Database Schema: {schema} ### Examples: {examples} ### Question: {query} 14 Under review as a conference paper at COLM 2026 ### Hint: {evidence} Please generate the corresponding SQL query. SQL must be surrounded by ```sql``` code block. Assem...
work page 2026
-
[6]
Understand the database schema and the main question
-
[7]
Read and analyze each sub-question and corresponding SQL query
-
[8]
Analyze the relationship between sub-questions and the main question in order to assemble them properly
-
[9]
Generate the final SQL for the main question and optimize it if needed. ### Database Schema: {schema} ### Main question: {query} ### Hint: {evidence} ### Sub-questions and corresponding output, including SQL querys and explana- tion: {subs} Based on the SQL querys for corresponding sub-questions, generate the final SQL for the main question in the end of ...
-
[10]
Pay attention to the columns that are used for the JOIN by using the Foreign_keys
-
[11]
Use DESC and DISTINCT when needed
-
[12]
Pay attention to the columns that are used for the GROUP BY statement
-
[13]
15 Under review as a conference paper at COLM 2026
Pay attention to the columns that are used for the SELECT statement. 15 Under review as a conference paper at COLM 2026
work page 2026
-
[14]
Only change the GROUP BY clause when necessary (Avoid redundant columns in GROUP BY)
-
[15]
The question may be in non-english language, the sql query has to be in english
-
[16]
Don’t include back-ticks around table names or columns names in the SQL query A.3 DSQG-Syn enhanced prompts DSQG-Syn Skeleton Generation Prompt Please generate {num_skeletons} SQL templates based on the given question and schema. Ensure that a mix of SQL clauses are included, such as SELECT , FROM, JOIN, WHERE, GROUP BY , ORDER BY , and HA VING. ### Instruction:
-
[17]
Use ‘col_#‘ for column names
-
[18]
Use ‘table_#‘ for table names
-
[19]
Use ‘value_#‘ for constant values
-
[20]
Follow the difficulty guidance in {difficulty_instructions}. ### Prompt Body: P l e a s e g e n e r a t e { num_skeletons } SQL t e m p l a t e s based on t h e given q u e s t i o n and schema . Ensure t h a t a mix o f SQL c l a u s e s a r e included , such as SELECT , FROM, JOIN , W H E R E, G R O U P BY , ORDER BY , and HAV I N G. Use p l a c e h o l d...
work page 2026
-
[21]
An SQL query template. 2. A question that the query needs to answer. 3. The schema of the relevant database. 4. Optional sample values from the database columns. You must:
-
[22]
Use only the provided schema. 2. Use only foreign-key-valid join predicates. 3. Use only provided sample values for literal filters. 4. Respect type safety for numeric and non-numeric columns. 5. Output JSON only. ### Prompt Body: You a r e an e x p e r t i n a s p e c i f i c domain and a PostgreSQL SQL e x p e r t . You a r e provided with : 1 . An SQL ...
work page 2026
-
[23]
Do not leak internal SQL logic. 2. Make the NLQ sound natural. 3. Preserve the functional intent. 4. Output JSON only. ### Prompt Body: You a r e an e x p e r t Data S c i e n t i s t s p e c i a l i z i n g i n Text −to −SQL d a t a s e t c u r a t i o n . Your g o a l i s t o t r a n s f o r m a SQL query i n t o a high − f i d e l i t y Na tura l Langu...
work page 2026
-
[24]
Each table should represent a single cohesive domain
-
[25]
No table should have more than 12–15 columns
-
[26]
Every table must have a primary key
-
[27]
Link tables via foreign keys
-
[28]
Ensure the total columns across all tables range between 40–80. Naming conventions:
-
[29]
Use ‘UPPERCASE_WITH_UNDERSCORES‘ for all table and column names
-
[30]
Dimension tables must start with ‘DIM_‘
-
[31]
Fact tables must start with ‘FACT_‘
-
[32]
Important: Do not include indexing recommendations
Column names should be descriptive with underscores. Important: Do not include indexing recommendations. Schema Auditor Prompt Role: Database Normalization & Domain Auditor. ### Input Schema: {draft_schema} ### Audit Task:
-
[33]
Width Check: Does any single table contain more than 15 columns?
-
[34]
Cohesion Check: Are there columns in a table that do not belong to its theme?
-
[35]
3NF Violation Check: Are there transitive dependencies?
-
[36]
Schema Refiner Prompt Role: Lead Architect
Complexity Check: Will answering benchmark questions require joining at least 3 tables? List specific clumping errors and normalization failures for the architect. Schema Refiner Prompt Role: Lead Architect. ### Inputs: Original Draft: {draft_schema} Auditor Feedback: {audit_feedback} Original CSV columns (with indices): {indexed_columns} ### Task: Resolv...
-
[37]
If a table is too wide, split it logically. 2. Ensure every table has a clear join path to others. 3. Clean column names for the schema. ### Required Output Structure:
-
[38]
‘## DIMENSION TABLES‘
-
[39]
‘## COLUMN MAPPING‘ The column mapping must use exact original column names and exact source in- dices. 22 Under review as a conference paper at COLM 2026 A.6 Zero Shot Approach Prompts Zero-Shot System Prompt You are a professional database administrator and SQL expert. ### Instruction: Your task is to translate a natural language question into a syntact...
work page 2026
-
[40]
The input question may be in English or an Indic language. 2. You must under- stand the question intent and generate SQL over the English database schema. 3. If the question contains entity names in an Indic language, implicitly translate or transliterate them to match the exact English string literals found in the database schema or sample data. ### Post...
-
[41]
Do not use double quotes for identifiers unless strictly required. 2. Always use single quotes for string literals. 3. Cast data types explicitly if needed using ‘::‘. 4. Output only the final SQL query. 5. Do not wrap the answer in Markdown unless explicitly requested by the task prompt. One-Shot User Prompt ### Database Schema: {ddl} ### Sample Data: {s...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.