pith. sign in

arxiv: 2604.13686 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI· cs.DB

IndicDB -- Benchmarking Multilingual Text-to-SQL Capabilities in Indian Languages

Pith reviewed 2026-05-10 13:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.DB
keywords multilingual text-to-sqlindic languagesbenchmarkcross-lingual semantic parsingschema linkingllm evaluationindian administrative data
0
0 comments X

The pith

A new benchmark built from Indian government data shows LLMs drop 9 percent in Text-to-SQL accuracy on Indic languages versus English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates IndicDB to test Text-to-SQL on realistic Indian administrative schemas across English, Hindi, and five other Indic languages. It converts raw open data into dense relational databases using an iterative agent process and then generates 15,617 calibrated query tasks. Evaluation of several current LLMs reveals a consistent 9 percent accuracy gap between English and Indic versions, linked to difficulties with schema linking, structural ambiguity, and missing external knowledge. A sympathetic reader would care because most existing Text-to-SQL benchmarks use simplified Western data, so this gap indicates that practical multilingual systems still face unaddressed barriers in non-English settings.

Core claim

IndicDB supplies 20 databases containing 237 tables drawn from Indian open-data platforms, converted into relational form with an average of 11.85 tables per database and join depths reaching six. A value-aware and join-enforced pipeline produces 15,617 tasks across seven language variants. When state-of-the-art models are tested on these tasks, performance falls by 9.00 percent from English to the Indic languages, with the drop traced to harder schema linking, greater structural ambiguity, and reduced access to external knowledge.

What carries the argument

The iterative three-agent framework (Architect, Auditor, Refiner) that turns denormalized government tables into high-density relational schemas while enforcing joins and calibrating task difficulty.

If this is right

  • Schema linking remains the dominant source of error when queries move from English to Indic languages.
  • Structural ambiguity increases measurably in Indic-language versions of the same database.
  • Limited external knowledge hurts performance more on Indic queries than on English ones.
  • The benchmark supplies a concrete test set for measuring progress on cross-lingual semantic parsing.
  • Real administrative schemas expose complexity that simplified English benchmarks hide.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gap may appear in other semantic-parsing tasks that rely on schema understanding.
  • Training data that includes Indic-language examples of database schemas could shrink the observed drop.
  • Extending the benchmark to additional Indic languages or other government domains would test whether the gap generalizes.
  • System builders may need language-specific schema-linking modules rather than relying on general multilingual pretraining.

Load-bearing premise

The three-agent framework produces tasks and schemas that faithfully represent real-world multilingual Text-to-SQL challenges without introducing generation artifacts or biases.

What would settle it

An independent collection of Indic-language Text-to-SQL tasks created without the three-agent pipeline that shows no accuracy difference from English would falsify the reported gap.

Figures

Figures reproduced from arXiv: 2604.13686 by Aviral Dawar, Dhruv Kumar, Roshan Karanth, Vikram Goyal.

Figure 1
Figure 1. Figure 1: Cross-lingual EX on IndicDB. Tel￾ugu exhibits the most significant accuracy reduction relative to English [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of a generated multilingual task 4. NLQ Synthesis. We prioritized linguistic vagueness during translation to ensure that Natural Language Questions (NLQs) reflect authentic human discourse rather than lit￾eral SQL-to-text mappings. By obscuring explicit schema identifiers (e.g., asking ”How many private clinics are there?” instead of ”Count the hospital IDs in the dim_facilities table where the typ… view at source ↗
Figure 4
Figure 4. Figure 4: Execution accuracy (EA) with and without evidence file augmentation across languages. Group-By 21% Schema Linking 20% Condition / Filter 15% Other 13% Join 12% Nested 11% Aggregation 8% Missing GROUP BY 14% Incomplete GROUP BY 7% Incorrect Tables 9% Incorrect Entities Incorrect 6% Columns 5% Missing / Extra Conditions 8% Incorrect Condition 7% Projection 7% Logical 6% Join Tables 7% Join Keys 5% Missing Su… view at source ↗
Figure 6
Figure 6. Figure 6: Case studies illustrating lexical and structural errors across languages [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: COMET-QE quality score distributions: (Left) aggregate distribution across the corpus, (Right) language-specific breakdown detailing the variance used for targeted hu￾man audit. A.8 Generated Evidence Example Question – Evidence pairs Question Provide the area, production, and yield statistics for maize and barley in Chhattisgarh for the year 1970. Evidence Select maize and barley area, production, yield f… view at source ↗
Figure 8
Figure 8. Figure 8: Impact of evidence files: (Left) distribution of improvements, (Right) execution accuracy gains across languages. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Schema diagram for a generated schema 25 [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
read the original abstract

While Large Language Models (LLMs) have significantly advanced Text-to-SQL performance, existing benchmarks predominantly focus on Western contexts and simplified schemas, leaving a gap in real-world, non-Western applications. We present IndicDB, a multilingual Text-to-SQL benchmark for evaluating cross-lingual semantic parsing across diverse Indic languages. The relational schemas are sourced from open-data platforms, including the National Data and Analytics Platform (NDAP) and the India Data Portal (IDP), ensuring realistic administrative data complexity. IndicDB comprises 20 databases across 237 tables. To convert denormalized government data into rich relational structures, we employ an iterative three-agent framework (Architect, Auditor, Refiner) to ensure structural rigor and high relational density (11.85 tables per database; join depths up to six). Our pipeline is value-aware, difficulty-calibrated, and join-enforced, generating 15,617 tasks across English, Hindi, and five Indic languages. We evaluate cross-lingual semantic parsing performance of state-of-the-art models (DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, Qwen3) across seven linguistic variants. Results show a 9.00% performance drop from English to Indic languages, revealing an "Indic Gap" driven by harder schema linking, increased structural ambiguity, and limited external knowledge. IndicDB serves as a rigorous benchmark for multilingual Text-to-SQL. Code and data: https://anonymous.4open.science/r/multilingualText2Sql-Indic--DDCC/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IndicDB, a multilingual Text-to-SQL benchmark with 15,617 tasks spanning English, Hindi, and five other Indic languages. It derives 20 realistic relational databases (237 tables total, avg. 11.85 tables/DB, joins to depth 6) from NDAP and IDP open data using an iterative three-agent LLM pipeline (Architect-Auditor-Refiner) described as value-aware, difficulty-calibrated, and join-enforced. Evaluation of models including DeepSeek v3.2, MiniMax 2.7, LLaMA 3.3, and Qwen3 reports a 9.00% average performance drop from English to Indic languages, interpreted as an 'Indic Gap' caused by harder schema linking, increased structural ambiguity, and limited external knowledge.

Significance. If the generated tasks are free of language-specific artifacts, IndicDB would be a valuable addition to Text-to-SQL resources by providing the first large-scale benchmark on real Indian administrative data in multiple Indic languages. The scale, relational complexity, and grounding in public government datasets distinguish it from prior synthetic or English-centric benchmarks and could usefully guide targeted improvements in cross-lingual semantic parsing.

major comments (2)
  1. The central claim of a 9.00% 'Indic Gap' is load-bearing on the assumption that the three-agent (Architect-Auditor-Refiner) pipeline produces tasks of equivalent difficulty and structure across languages. Because the agents are LLMs whose English performance substantially exceeds their Indic performance, systematic differences in query formulation, difficulty calibration, or join enforcement could artifactually inflate the measured gap. The manuscript provides no human validation, cross-lingual ablation, or quantitative comparison of task statistics (e.g., join depth distributions, schema-linking complexity) between English and Indic variants to rule out generation bias.
  2. The reported 9.00% performance drop is presented without error bars, confidence intervals, or statistical significance tests. The abstract and evaluation description also omit full per-model, per-language result tables or breakdowns, preventing assessment of whether the gap is consistent across models or driven by particular languages or query types.
minor comments (2)
  1. Clarify the exact set of seven linguistic variants evaluated (the abstract lists 'English, Hindi, and five Indic languages' but does not enumerate the five additional languages).
  2. The anonymous code/data link limits reproducibility assessment; consider providing a non-anonymous repository or detailed data-release plan in the camera-ready version.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and commit to revisions that will strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: The central claim of a 9.00% 'Indic Gap' is load-bearing on the assumption that the three-agent (Architect-Auditor-Refiner) pipeline produces tasks of equivalent difficulty and structure across languages. Because the agents are LLMs whose English performance substantially exceeds their Indic performance, systematic differences in query formulation, difficulty calibration, or join enforcement could artifactually inflate the measured gap. The manuscript provides no human validation, cross-lingual ablation, or quantitative comparison of task statistics (e.g., join depth distributions, schema-linking complexity) between English and Indic variants to rule out generation bias.

    Authors: We acknowledge that an LLM-based generation pipeline carries a risk of language-dependent artifacts. The three-agent design attempts to mitigate this by using language-agnostic structural constraints (value-aware schema construction, validity auditing, and join-enforced refinement), but we agree this is insufficient without explicit verification. In the revised manuscript we will add: (i) side-by-side quantitative distributions of join depth, number of tables per query, predicate count, and schema-linking complexity metrics for English versus each Indic language; (ii) a cross-lingual ablation in which English-generated queries are machine-translated and re-run through the Auditor-Refiner stages for direct comparison; and (iii) a human validation study on a stratified sample of 300 tasks (50 per language) in which native speakers rate difficulty, naturalness, and structural equivalence. These additions will allow readers to assess whether the observed gap is attributable to model limitations rather than data generation bias. revision: yes

  2. Referee: The reported 9.00% performance drop is presented without error bars, confidence intervals, or statistical significance tests. The abstract and evaluation description also omit full per-model, per-language result tables or breakdowns, preventing assessment of whether the gap is consistent across models or driven by particular languages or query types.

    Authors: We agree that the current presentation lacks necessary statistical detail and granularity. The revised manuscript will include: error bars and 95% confidence intervals on all aggregate and per-language scores; paired statistical significance tests (Wilcoxon signed-rank and McNemar tests) for the English-to-Indic gap both overall and per model; and expanded evaluation tables reporting exact accuracy for every model-language pair together with breakdowns by query type (join depth 0-2, 3-4, 5-6; number of tables; presence of value linking). The abstract will be updated to note that detailed per-model and per-language results, including statistical tests, appear in the main text and appendix. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction and external evaluation

full rationale

The paper constructs IndicDB by applying an iterative three-agent (Architect-Auditor-Refiner) pipeline to external NDAP and IDP government data sources, then evaluates off-the-shelf LLMs (DeepSeek, LLaMA, Qwen, etc.) on the resulting 15,617 tasks. The reported 9% English-to-Indic drop and 'Indic Gap' attribution are direct empirical measurements, not derived quantities. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the described chain; the methodology is self-contained against external models and real-world data without reducing the central claims to the generation process by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new entities; the claim rests on assumptions about data fidelity and task realism from government sources.

pith-pipeline@v0.9.0 · 5591 in / 995 out tokens · 40011 ms · 2026-05-10T13:17:15.551106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages

  1. [1]

    Parse user questions: Use natural language processing (NLP) techniques to parse user questions and extract query requirements and conditions

  2. [2]

    Analyze database schema: Based on the database schema, understand the fields and relationships of the table, and build the basic framework of the SQL query

  3. [3]

    Check sample data: Analyze the data characteristics based on the first three rows of the table values to help determine how to construct query conditions and filter results

  4. [4]

    Generate SQL query: Based on user questions, query requirements and condi- tions, database schema, and sample data, build a complete SQL query

  5. [5]

    Verification and optimization: Check whether the generated SQL query is logical and optimize it if necessary. ### Database Schema: {schema} ### Examples: {examples} ### Question: {query} 14 Under review as a conference paper at COLM 2026 ### Hint: {evidence} Please generate the corresponding SQL query. SQL must be surrounded by ```sql``` code block. Assem...

  6. [6]

    Understand the database schema and the main question

  7. [7]

    Read and analyze each sub-question and corresponding SQL query

  8. [8]

    Analyze the relationship between sub-questions and the main question in order to assemble them properly

  9. [9]

    Generate the final SQL for the main question and optimize it if needed. ### Database Schema: {schema} ### Main question: {query} ### Hint: {evidence} ### Sub-questions and corresponding output, including SQL querys and explana- tion: {subs} Based on the SQL querys for corresponding sub-questions, generate the final SQL for the main question in the end of ...

  10. [10]

    Pay attention to the columns that are used for the JOIN by using the Foreign_keys

  11. [11]

    Use DESC and DISTINCT when needed

  12. [12]

    Pay attention to the columns that are used for the GROUP BY statement

  13. [13]

    15 Under review as a conference paper at COLM 2026

    Pay attention to the columns that are used for the SELECT statement. 15 Under review as a conference paper at COLM 2026

  14. [14]

    Only change the GROUP BY clause when necessary (Avoid redundant columns in GROUP BY)

  15. [15]

    The question may be in non-english language, the sql query has to be in english

  16. [16]

    Ensure that a mix of SQL clauses are included, such as SELECT , FROM, JOIN, WHERE, GROUP BY , ORDER BY , and HA VING

    Don’t include back-ticks around table names or columns names in the SQL query A.3 DSQG-Syn enhanced prompts DSQG-Syn Skeleton Generation Prompt Please generate {num_skeletons} SQL templates based on the given question and schema. Ensure that a mix of SQL clauses are included, such as SELECT , FROM, JOIN, WHERE, GROUP BY , ORDER BY , and HA VING. ### Instruction:

  17. [17]

    Use ‘col_#‘ for column names

  18. [18]

    Use ‘table_#‘ for table names

  19. [19]

    Use ‘value_#‘ for constant values

  20. [20]

    q u e s t i o n

    Follow the difficulty guidance in {difficulty_instructions}. ### Prompt Body: P l e a s e g e n e r a t e { num_skeletons } SQL t e m p l a t e s based on t h e given q u e s t i o n and schema . Ensure t h a t a mix o f SQL c l a u s e s a r e included , such as SELECT , FROM, JOIN , W H E R E, G R O U P BY , ORDER BY , and HAV I N G. Use p l a c e h o l d...

  21. [21]

    An SQL query template. 2. A question that the query needs to answer. 3. The schema of the relevant database. 4. Optional sample values from the database columns. You must:

  22. [22]

    q u e r i e s

    Use only the provided schema. 2. Use only foreign-key-valid join predicates. 3. Use only provided sample values for literal filters. 4. Respect type safety for numeric and non-numeric columns. 5. Output JSON only. ### Prompt Body: You a r e an e x p e r t i n a s p e c i f i c domain and a PostgreSQL SQL e x p e r t . You a r e provided with : 1 . An SQL ...

  23. [23]

    d e t a i l s

    Do not leak internal SQL logic. 2. Make the NLQ sound natural. 3. Preserve the functional intent. 4. Output JSON only. ### Prompt Body: You a r e an e x p e r t Data S c i e n t i s t s p e c i a l i z i n g i n Text −to −SQL d a t a s e t c u r a t i o n . Your g o a l i s t o t r a n s f o r m a SQL query i n t o a high − f i d e l i t y Na tura l Langu...

  24. [24]

    Each table should represent a single cohesive domain

  25. [25]

    No table should have more than 12–15 columns

  26. [26]

    Every table must have a primary key

  27. [27]

    Link tables via foreign keys

  28. [28]

    Naming conventions:

    Ensure the total columns across all tables range between 40–80. Naming conventions:

  29. [29]

    Use ‘UPPERCASE_WITH_UNDERSCORES‘ for all table and column names

  30. [30]

    Dimension tables must start with ‘DIM_‘

  31. [31]

    Fact tables must start with ‘FACT_‘

  32. [32]

    Important: Do not include indexing recommendations

    Column names should be descriptive with underscores. Important: Do not include indexing recommendations. Schema Auditor Prompt Role: Database Normalization & Domain Auditor. ### Input Schema: {draft_schema} ### Audit Task:

  33. [33]

    Width Check: Does any single table contain more than 15 columns?

  34. [34]

    Cohesion Check: Are there columns in a table that do not belong to its theme?

  35. [35]

    3NF Violation Check: Are there transitive dependencies?

  36. [36]

    Schema Refiner Prompt Role: Lead Architect

    Complexity Check: Will answering benchmark questions require joining at least 3 tables? List specific clumping errors and normalization failures for the architect. Schema Refiner Prompt Role: Lead Architect. ### Inputs: Original Draft: {draft_schema} Auditor Feedback: {audit_feedback} Original CSV columns (with indices): {indexed_columns} ### Task: Resolv...

  37. [37]

    If a table is too wide, split it logically. 2. Ensure every table has a clear join path to others. 3. Clean column names for the schema. ### Required Output Structure:

  38. [38]

    ‘## DIMENSION TABLES‘

  39. [39]

    22 Under review as a conference paper at COLM 2026 A.6 Zero Shot Approach Prompts Zero-Shot System Prompt You are a professional database administrator and SQL expert

    ‘## COLUMN MAPPING‘ The column mapping must use exact original column names and exact source in- dices. 22 Under review as a conference paper at COLM 2026 A.6 Zero Shot Approach Prompts Zero-Shot System Prompt You are a professional database administrator and SQL expert. ### Instruction: Your task is to translate a natural language question into a syntact...

  40. [40]

    The input question may be in English or an Indic language. 2. You must under- stand the question intent and generate SQL over the English database schema. 3. If the question contains entity names in an Indic language, implicitly translate or transliterate them to match the exact English string literals found in the database schema or sample data. ### Post...

  41. [41]

    Do not use double quotes for identifiers unless strictly required. 2. Always use single quotes for string literals. 3. Cast data types explicitly if needed using ‘::‘. 4. Output only the final SQL query. 5. Do not wrap the answer in Markdown unless explicitly requested by the task prompt. One-Shot User Prompt ### Database Schema: {ddl} ### Sample Data: {s...