Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources

Anindya Moitra; Ankur Vivek Singh; Darshita Rathore; Vaibhav Singal; Vineet Kumar

arxiv: 2606.28370 · v1 · pith:RYLGPTWGnew · submitted 2026-06-15 · 💻 cs.IR · cs.AI

Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources

Darshita Rathore , Vineet Kumar , Vaibhav Singal , Ankur Vivek Singh , Anindya Moitra This is my paper

Pith reviewed 2026-06-30 11:15 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords conversational BInatural language analyticsheterogeneous data sourcesNL2SQLRAGquery routingenterprise analyticsquery caching

0 comments

The pith

COGNI unifies natural-language queries over structured warehouses and unstructured documents through four architectural layers in a production BI system.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COGNI to solve the split between NL2SQL tools that ignore documents and RAG systems that ignore live tables by treating mixed-modality enterprise analytics as one query-processing task. It decomposes the problem into an indexing layer that adapts chunking to slide content, a routing layer that decides modality and complexity with a small fine-tuned model, a retrieval layer that runs self-correcting agents or recursive synthesis, and a caching layer that checks equivalence beyond embeddings. A sympathetic reader would care because this removes the need for users to pick the right backend and reports concrete gains in accuracy and speed on the authors' internal benchmark. The work shows how production constraints like cost and latency shape the choice of each component.

Core claim

COGNI treats natural-language analytics as a heterogeneous query processing problem organized as four architectural layers. First, an indexing layer implements slide-adaptive chunking for plain-text, tables, charts and key-value blocks. Second, a routing layer built on a LoRA fine-tuned Qwen-2.5-1.5B-Instruct model outputs modality decision and complexity assessment. Third, a retrieval layer runs a self-correcting NL2SQL agent and Recursive Language Models for multi-hop synthesis. Fourth, a caching layer validates query equivalence across multiple dimensions to avoid false hits.

What carries the argument

The four architectural layers that decompose heterogeneous query processing into indexing with slide-adaptive chunking, routing with a fine-tuned small model, retrieval with self-correcting agents, and caching with multi-dimensional equivalence validation.

If this is right

Users issue a single natural-language question without choosing between SQL and document tools.
Modality and complexity decisions run at 93.8 percent accuracy and roughly seven times lower cost than frontier models.
NL2SQL queries reach 93.9 percent G-Eval and multi-hop synthesis reaches 91.0 percent through self-correction.
Query caching delivers an 8.4 times latency reduction while producing zero false hits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same four-layer split could be tested on mixed data in legal discovery or clinical records where structured tables and free-text notes coexist.
Routing via a 1.5B fine-tuned model suggests that production systems can keep most decisions cheap while reserving larger models for final synthesis.
Multi-dimensional equivalence validation might improve cache safety in other retrieval systems that currently rely only on embedding similarity.

Load-bearing premise

The internal enterprise benchmark used for all accuracy numbers is representative of real user queries and the evaluation procedures introduce no post-hoc selection or labeling bias.

What would settle it

Running COGNI on a public mixed-modality query set and finding accuracy below the reported 88-93 percent range or any false-positive cache hits on non-equivalent queries would disprove the central performance claims.

Figures

Figures reproduced from arXiv: 2606.28370 by Anindya Moitra, Ankur Vivek Singh, Darshita Rathore, Vaibhav Singal, Vineet Kumar.

**Figure 1.** Figure 1: High-level architecture of COGNI. 3 Document Retrieval Pipeline The document journey has two stages: an indexing stage that converts enterprise slide decks into a queryable knowledge base, and a retrieval stage that, for each query, selects from two strategies based on the router’s complexity signal. Both stages operate over the same corpus and feed the same answer generator. The design choices below were… view at source ↗

**Figure 2.** Figure 2: COGNI query flow. A query enters the routing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Routing accuracy on Finetuned Model Cost profile. The fine-tuned model runs on a single T4 GPU at a fixed monthly cost of $360 ($0.50/hour × 720 hours) independent of request volume. GPT-5-Mini’s per-token pricing scales linearly: ∼$2.50/month at 10K requests, ∼$2,500/month at 10M requests. At enterprise scale, the fine-tuned model is approximately 7× cheaper at higher accuracy. Single-pass inference produ… view at source ↗

read the original abstract

Enterprise business intelligence queries span structured warehouses and unstructured document repositories -- modalities with fundamentally different access methods, cost profiles, and correctness semantics. Existing AI-enabled interfaces force users to select the right tool: NL2SQL systems cannot reason over slide decks, and RAG pipelines lack access to live warehouse tables. We present COGNI, a production conversational BI system that treats natural-language analytics as a heterogeneous query processing problem, organized as four architectural layers. First, an indexing layer implements slide-adaptive chunking -- recursive chunking for plain-text slides, hierarchical chunking for structured content such as tables, charts, and key-value blocks - achieving $88.3\%$ on our internal enterprise benchmark. Second, a routing layer built on a LoRA fine-tuned Qwen-2.5-1.5B-Instruct model that produces a dual output - modality decision and complexity assessment at $93.8\%$ accuracy and approximately $7\times$ lower cost than frontier-model. Third, a retrieval layer executes complexity-adaptive pipelines: a self-correcting NL2SQL agent at $93.9\%$ G-Eval, and Recursive Language Models reaching $91.0\%$ on multi-hop synthesis queries. Finally, a caching layer validates query equivalence across multiple dimensions beyond embedding similarity, achieving zero false cache hits and $8.4\times$ latency reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COGNI describes a four-layer production system for mixed structured/unstructured enterprise queries, but every accuracy number rests on one internal benchmark with no construction details or external checks.

read the letter

The paper's core contribution is COGNI, a named system that routes natural-language analytics queries across warehouses and document stores using four layers: adaptive indexing, LoRA-based routing plus complexity scoring, self-correcting retrieval, and multi-dimensional caching. The architecture directly tackles the practical split between NL2SQL and RAG tools that many enterprise teams face.

The engineering choices are reasonable. Slide-adaptive chunking handles tables, charts, and plain text differently. The small LoRA router aims for lower cost than frontier models. Recursive Language Models for multi-hop synthesis and equivalence-checked caching to cut latency are standard tactics applied in a combined stack. If the full paper shows working code or ablation results, those pieces could be worth borrowing.

The soft spot is straightforward: all headline figures (88.3% indexing, 93.8% routing, 93.9% G-Eval, 91% synthesis, zero false cache hits, 8.4× latency drop) come from a single undisclosed internal enterprise benchmark. No information appears on query sampling, labeling protocol, inter-annotator agreement, or how G-Eval and equivalence checks were validated. LLM-mediated scoring introduces circularity risk when the same model family may be involved in both generation and judgment. Without those details the numbers cannot be reproduced or stress-tested.

This is for teams building internal BI interfaces who want to see one concrete layering. It is not a foundational methods paper. The architecture itself is coherent and addresses a real friction point, but the evidence base is too thin for strong claims.

I would send it to peer review so referees can check the benchmark construction and any additional experiments in the full text. The work is worth a serious look once the measurement details are on the table.

Referee Report

1 major / 1 minor

Summary. The manuscript presents COGNI, a production conversational BI system that frames natural-language analytics over mixed structured warehouses and unstructured document repositories as a heterogeneous query processing problem. It describes a four-layer architecture (indexing with slide-adaptive chunking, routing via LoRA-tuned Qwen-2.5-1.5B, retrieval with self-correcting NL2SQL and Recursive Language Models, and multi-dimensional caching) and reports empirical results of 88.3% (indexing), 93.8% (routing), 93.9% G-Eval (NL2SQL), 91.0% (multi-hop synthesis), zero false cache hits, and 8.4× latency reduction, all measured on an internal enterprise benchmark.

Significance. If the results hold under external scrutiny, the work would be significant for enterprise IR by providing a unified production system that avoids forcing users to choose between NL2SQL and RAG pipelines. The explicit layering, cost-efficient LoRA routing, and equivalence-aware caching represent practical engineering contributions that could inform future heterogeneous retrieval systems.

major comments (1)

[Abstract] Abstract: every reported performance number (88.3%, 93.8%, 93.9% G-Eval, 91.0%, zero false cache hits, 8.4× latency) rests exclusively on a single undisclosed internal enterprise benchmark. No details are supplied on query collection method, modality/complexity distribution, ground-truth labeling protocol, inter-annotator agreement, or precise success criteria, rendering the central empirical claims impossible to reproduce or stress-test for selection or labeling bias.

minor comments (1)

[Abstract] The terms 'G-Eval' and 'Recursive Language Models' appear without prior definition or citation in the abstract.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their review and for recognizing the practical engineering contributions of the COGNI architecture. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: every reported performance number (88.3%, 93.8%, 93.9% G-Eval, 91.0%, zero false cache hits, 8.4× latency) rests exclusively on a single undisclosed internal enterprise benchmark. No details are supplied on query collection method, modality/complexity distribution, ground-truth labeling protocol, inter-annotator agreement, or precise success criteria, rendering the central empirical claims impossible to reproduce or stress-test for selection or labeling bias.

Authors: We acknowledge the validity of this observation. All reported metrics derive from a single internal enterprise benchmark whose queries, modality distribution, labeling process, inter-annotator agreement, and exact success criteria cannot be disclosed. This restriction stems directly from confidentiality agreements with enterprise clients whose proprietary data populate the benchmark; releasing such information would violate those agreements. Consequently, external reproduction or independent bias auditing is not feasible. We do not view this as a flaw that can be remedied by revision, because the limitation is inherent to any production system paper that must protect client data. The metrics are presented to demonstrate real-world behavior rather than to support exact replication. revision: no

standing simulated objections not resolved

Disclosure of query collection method, modality/complexity distribution, ground-truth labeling protocol, inter-annotator agreement, or precise success criteria for the internal enterprise benchmark, due to binding confidentiality agreements with enterprise clients.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a four-layer architecture for COGNI and reports empirical performance numbers (88.3%, 93.8%, 93.9% G-Eval, 91.0%) measured on an internal benchmark. No equations, derivations, fitted parameters, or predictions appear that reduce by construction to inputs, self-citations, or ansatzes. All load-bearing claims are direct experimental results rather than quantities defined in terms of themselves, so the presentation is self-contained with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no mathematical derivations, free parameters, or new entities; all claims rest on the existence and representativeness of an internal benchmark whose construction details are not provided.

pith-pipeline@v0.9.1-grok · 5791 in / 1154 out tokens · 23144 ms · 2026-06-30T11:15:32.647882+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 12 canonical work pages · 6 internal anchors

[1]

Fu Bang et al . 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications.arXiv preprint arXiv:2311.17174(2023)

work page arXiv 2023
[2]

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. 2024. LoRA Learns Less and Forgets Less.Transactions on Machine Learning Research(2024). https: //openreview.net/forum?id=aloEru2qCG

2024
[3]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, 758–759

2009
[4]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. DAIL-SQL: Optimized Few-Shot Text-to-SQL with Retrieval and LLMs.Proceedings of the VLDB Endowment17, 4 (2024), 950–962

2024
[5]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Gemma Team, Google DeepMind. 2025. Gemma 3 Technical Report. https: //ai.google.dev/gemma. Accessed 2026-04-15

2025
[7]

Google Cloud. 2026. BigQuery: Serverless, highly scalable, and cost-effective data warehouse. https://cloud.google.com/bigquery Accessed: 2026-03-30

2026
[8]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo
[9]

A Survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Daniel Han, Michael Han, and Unsloth team. 2024. Unsloth: Faster, more memory- efficient LLM fine-tuning. https://github.com/unslothai/unsloth. GitHub reposi- tory, accessed 2026-04-15

2024
[11]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR). https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Kanishka, Chetna Bansal, and Anindya Moitra

Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh Man- dora, . Kanishka, Chetna Bansal, and Anindya Moitra. 2025. Genicious: Contextual Agent4IR’ 26 KDD, August 9th-13th, 2026, Jeju, Korea Darshita Rathore, Vineet Kumar, Vaibhav Singal, Ankur Vivek Singh, and Anindya Moitra Few-shot Prompting for Insights Discovery. InProceedings of the...

work page doi:10.1145/3703323.3704274 2025
[13]

LangChain. 2026. LangGraph: Agent orchestration framework for reliable AI agents. https://www.langchain.com/langgraph Accessed: 2026-03-30

2026
[14]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al
[15]

In Advances in Neural Information Processing Systems (NeurIPS), Vol

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9459–9474
[16]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can LLM Already Serve as a Database Interface? A BIRd Benchmark for Big Bench for Large-Scale Database Grounded Text-to-SQL Evaluation. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

2024
[17]

Siran Li, Linus Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. 2025. Enhanc- ing Retrieval-Augmented Generation: A Study of Best Practices. InProceedings of the 31st International Conference on Computational Linguistics (COLING). 6705– 6717

2025
[18]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

work page doi:10.18653/v1/2023.emnlp- 2023
[19]

Wensheng Lu, Keyu Chen, Ruizhi Qiao, and Xing Sun. 2025. Hichunk: Evaluating and enhancing retrieval-augmented generation with hierarchical chunking.arXiv preprint arXiv:2509.11552(2025)

work page arXiv 2025
[20]

Meta AI (Facebook AI Research). 2026. Faiss: A library for efficient similarity search and clustering of dense vectors. https://faiss.ai/index.html Accessed: 2026-03-30

2026
[21]

Microsoft. 2026. SharePoint Collaboration. https://www.microsoft.com/en- in/microsoft-365/sharepoint/collaboration. Accessed: 2026-03-28

2026
[22]

Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö

Mohammadreza Pourreza, Hailong Li, Ruochen Sun, Yeounoh Chung, Shayan Talaei, Gaurav T. Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö. Arik
[23]

InInternational Conference on Learning Representations (ICLR)

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. InInternational Conference on Learning Representations (ICLR)
[24]

Mohammadreza Pourreza and Davood Rafiei. 2024. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

2024
[25]

Darshita Rathore, Vineet Kumar, Chetna Bansal, and Anindya Moitra. 2025. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.findings-ijcnlp.58 2025
[26]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389

2009
[27]

Lucas Seabra et al. 2024. Dynamic Multi-Agent Orchestration and Retrieval for Multi-Source Question-Answer Systems Using Large Language Models.arXiv preprint arXiv:2412.17944(2024)

work page arXiv 2024
[28]

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis. arXiv preprint arXiv:2405.16755(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

VectifyAI. 2026. PageIndex: Document Index for Vectorless, Reasoning-based RAG. https://github.com/VectifyAI/PageIndex. GitHub repository, accessed 2026-03-28

2026
[31]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[32]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3911–3921

2018
[33]

Recursive Language Models

Alex L Zhang, Tim Kraska, and Omar Khattab. 2025. Recursive language models. arXiv preprint arXiv:2512.24601(2025). Appendix A Production Model Configuration Component Model Retrieval Embedding text-embedding-3-large (3072d) LLM reranking / answer synthesis claude-sonnet-4-6 Hierarchical extraction (vision) gemini-2.5-flash RLM root / sub-agent GPT-5 / GP...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

ONLY use column names from <db_mapping> tags in metadata
[35]

AOV", "ARPA

DO NOT use measure names directly (e.g., "AOV", "ARPA") - these are labels, not columns
[36]

For calculated metrics: - Check if there's a <formula> tag - use that exact formula - Otherwise, derive from basic columns (e.g., TPV/Txns for AOV)
[37]

Always" or

For column names with spaces or special characters: - Use backticks:`column name with spaces` ***CRITICAL REQUIREMENTS*** - Check table descriptions for mandatory filter requirements - Pay attention to dimension definitions that mention "Always" or "must" requirements - Do not assume or hardcode values for any columns - Prefer subqueries or filtering from...
[38]

First, identify what columns exist in <db_mapping> tags
[39]

For aggregated metrics mentioned in user query: - Check if column exists in db_mapping - If not, check for <formula> in measures section - If no formula, calculate from base columns
[40]

Always validate column names match exactly Give only the SQL as output, no explanations. Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources Agent4IR’ 26 KDD, August 9th-13th, 2026, Jeju, Korea B.2 NL2SQL Error Correction Invoked when the previous SQL execution fails. Receives the failed SQL, the error message, and the sch...

2026
[41]

Use ONLY BigQuery Standard SQL (not legacy SQL)
[42]

Use fully-qualified table names (project.dataset.table)
[43]

Map user's natural language to exact column names using synonyms provided
[44]

For derived metrics, use the formulas from metadata
[45]

ALWAYS include a LIMIT clause (default: LIMIT 1000)
[46]

Fix the specific error mentioned
[47]

Ensure column names exactly match db_mapping values
[48]

Use proper aggregation and GROUP BY clauses
[49]

Column X not found

Return ONLY the corrected SQL query, no explanations COMMON ERRORS AND FIXES: - "Column X not found" -> Check db_mapping, use exact name - "Type mismatch" -> Use CAST() for date comparisons - "JOIN error" -> Verify JOIN keys exist in both tables - "WITH clause" -> CTEs allowed, ensure final SELECT has LIMIT Return ONLY the SQL query. B.3 SQL G-Eval Rubric...
[50]

QUERY UNDERSTANDING (Weight: 25%) - Does the answer directly address the user's question? - Is the scope correct (timeframe, geography, metrics)? - Is all information grounded in the data? - Is the detail level appropriate?
[51]

TABLE SELECTION (Weight: 35%) Score 9-10 if correct table, 7-8 if acceptable, lower if wrong table
[52]

SQL RELEVANCE (Weight: 40%) - Are aggregations correct (SUM, AVG, COUNT)? - Are WHERE filters appropriate FOR THE QUESTION ASKED? - Is GROUP BY correct? - Do all columns exist in the table? - Is the SQL logic sound? IMPORTANT: Only require WHERE filters that are NECESSARY to answer the specific question. === HARD-FAIL CAPS === - Wrong table: score <= 4 - ...
[53]

Write a 2-3 sentence narrative summary answering the user's question
[54]

List 2-4 key highlights as bullet points (trends, patterns, anomalies, or notable values)
[55]

Focus on actionable insights
[56]

Mention specific numbers and metrics
[57]

Flag any data quality issues (nulls, zeros, unexpected values) if present FORMAT: Summary: [2-3 sentence overview] Key Highlights: - [First key finding with specific numbers] - [Second key finding] - [Additional findings as relevant] Return ONLY the formatted insights

[1] [1]

Fu Bang et al . 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications.arXiv preprint arXiv:2311.17174(2023)

work page arXiv 2023

[2] [2]

Dan Biderman, Jacob Portes, Jose Javier Gonzalez Ortiz, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, Cody Blakeney, and John Patrick Cunningham. 2024. LoRA Learns Less and Forgets Less.Transactions on Machine Learning Research(2024). https: //openreview.net/forum?id=aloEru2qCG

2024

[3] [3]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. 2009. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM, 758–759

2009

[4] [4]

Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. DAIL-SQL: Optimized Few-Shot Text-to-SQL with Retrieval and LLMs.Proceedings of the VLDB Endowment17, 4 (2024), 950–962

2024

[5] [5]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2024. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Gemma Team, Google DeepMind. 2025. Gemma 3 Technical Report. https: //ai.google.dev/gemma. Accessed 2026-04-15

2025

[7] [7]

Google Cloud. 2026. BigQuery: Serverless, highly scalable, and cost-effective data warehouse. https://cloud.google.com/bigquery Accessed: 2026-03-30

2026

[8] [8]

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Yuanzhuo Wang, and Jian Guo

[9] [9]

A Survey on LLM-as-a-Judge.arXiv preprint arXiv:2411.15594(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Daniel Han, Michael Han, and Unsloth team. 2024. Unsloth: Faster, more memory- efficient LLM fine-tuning. https://github.com/unslothai/unsloth. GitHub reposi- tory, accessed 2026-04-15

2024

[11] [11]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations (ICLR). https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Kanishka, Chetna Bansal, and Anindya Moitra

Vineet Kumar, Ronald Tony, Darshita Rathore, Vipasha Rana, Bhuvanesh Man- dora, . Kanishka, Chetna Bansal, and Anindya Moitra. 2025. Genicious: Contextual Agent4IR’ 26 KDD, August 9th-13th, 2026, Jeju, Korea Darshita Rathore, Vineet Kumar, Vaibhav Singal, Ankur Vivek Singh, and Anindya Moitra Few-shot Prompting for Insights Discovery. InProceedings of the...

work page doi:10.1145/3703323.3704274 2025

[13] [13]

LangChain. 2026. LangGraph: Agent orchestration framework for reliable AI agents. https://www.langchain.com/langgraph Accessed: 2026-03-30

2026

[14] [14]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

[15] [15]

In Advances in Neural Information Processing Systems (NeurIPS), Vol

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Advances in Neural Information Processing Systems (NeurIPS), Vol. 33. 9459–9474

[16] [16]

Jinyang Li, Binyuan Hui, Ge Qu, Jiaxi Yang, Binhua Li, Bowen Li, Bailin Wang, Bowen Qin, Ruiying Geng, Nan Huo, et al. 2024. Can LLM Already Serve as a Database Interface? A BIRd Benchmark for Big Bench for Large-Scale Database Grounded Text-to-SQL Evaluation. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

2024

[17] [17]

Siran Li, Linus Stenzel, Carsten Eickhoff, and Seyed Ali Bahrainian. 2025. Enhanc- ing Retrieval-Augmented Generation: A Study of Best Practices. InProceedings of the 31st International Conference on Computational Linguistics (COLING). 6705– 6717

2025

[18] [18]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 2511–2522. doi:...

work page doi:10.18653/v1/2023.emnlp- 2023

[19] [19]

Wensheng Lu, Keyu Chen, Ruizhi Qiao, and Xing Sun. 2025. Hichunk: Evaluating and enhancing retrieval-augmented generation with hierarchical chunking.arXiv preprint arXiv:2509.11552(2025)

work page arXiv 2025

[20] [20]

Meta AI (Facebook AI Research). 2026. Faiss: A library for efficient similarity search and clustering of dense vectors. https://faiss.ai/index.html Accessed: 2026-03-30

2026

[21] [21]

Microsoft. 2026. SharePoint Collaboration. https://www.microsoft.com/en- in/microsoft-365/sharepoint/collaboration. Accessed: 2026-03-28

2026

[22] [22]

Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö

Mohammadreza Pourreza, Hailong Li, Ruochen Sun, Yeounoh Chung, Shayan Talaei, Gaurav T. Kakkar, Yu Gan, Amin Saberi, Fatma Ozcan, and Sercan Ö. Arik

[23] [23]

InInternational Conference on Learning Representations (ICLR)

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL. InInternational Conference on Learning Representations (ICLR)

[24] [24]

Mohammadreza Pourreza and Davood Rafiei. 2024. DIN-SQL: Decomposed In- Context Learning of Text-to-SQL with Self-Correction. InAdvances in Neural Information Processing Systems (NeurIPS), Vol. 36

2024

[25] [25]

Darshita Rathore, Vineet Kumar, Chetna Bansal, and Anindya Moitra. 2025. How Much is Too Much? Exploring LoRA Rank Trade-offs for Retaining Knowledge and Domain Robustness. InProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguist...

work page doi:10.18653/v1/2025.findings-ijcnlp.58 2025

[26] [26]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Frame- work: BM25 and Beyond.Foundations and Trends in Information Retrieval3, 4 (2009), 333–389

2009

[27] [27]

Lucas Seabra et al. 2024. Dynamic Multi-Agent Orchestration and Retrieval for Multi-Source Question-Answer Systems Using Large Language Models.arXiv preprint arXiv:2412.17944(2024)

work page arXiv 2024

[28] [28]

Shayan Talaei, Mohammadreza Pourreza, Yu-Chen Chang, Azalia Mirhoseini, and Amin Saberi. 2024. CHESS: Contextual Harnessing for Efficient SQL Synthesis. arXiv preprint arXiv:2405.16755(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[29] [29]

Qwen Team. 2024. Qwen2.5 Technical Report. arXiv:2412.15115 [cs.CL] https: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

VectifyAI. 2026. PageIndex: Document Index for Vectorless, Reasoning-based RAG. https://github.com/VectifyAI/PageIndex. GitHub repository, accessed 2026-03-28

2026

[31] [31]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023

[32] [32]

Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, et al. 2018. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP). 3911–3921

2018

[33] [33]

Recursive Language Models

Alex L Zhang, Tim Kraska, and Omar Khattab. 2025. Recursive language models. arXiv preprint arXiv:2512.24601(2025). Appendix A Production Model Configuration Component Model Retrieval Embedding text-embedding-3-large (3072d) LLM reranking / answer synthesis claude-sonnet-4-6 Hierarchical extraction (vision) gemini-2.5-flash RLM root / sub-agent GPT-5 / GP...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

ONLY use column names from <db_mapping> tags in metadata

[35] [35]

AOV", "ARPA

DO NOT use measure names directly (e.g., "AOV", "ARPA") - these are labels, not columns

[36] [36]

For calculated metrics: - Check if there's a <formula> tag - use that exact formula - Otherwise, derive from basic columns (e.g., TPV/Txns for AOV)

[37] [37]

Always" or

For column names with spaces or special characters: - Use backticks:`column name with spaces` ***CRITICAL REQUIREMENTS*** - Check table descriptions for mandatory filter requirements - Pay attention to dimension definitions that mention "Always" or "must" requirements - Do not assume or hardcode values for any columns - Prefer subqueries or filtering from...

[38] [38]

First, identify what columns exist in <db_mapping> tags

[39] [39]

For aggregated metrics mentioned in user query: - Check if column exists in db_mapping - If not, check for <formula> in measures section - If no formula, calculate from base columns

[40] [40]

Always validate column names match exactly Give only the SQL as output, no explanations. Conversational Query Engine for Mixed-Modality Heterogeneous Enterprise Data Sources Agent4IR’ 26 KDD, August 9th-13th, 2026, Jeju, Korea B.2 NL2SQL Error Correction Invoked when the previous SQL execution fails. Receives the failed SQL, the error message, and the sch...

2026

[41] [41]

Use ONLY BigQuery Standard SQL (not legacy SQL)

[42] [42]

Use fully-qualified table names (project.dataset.table)

[43] [43]

Map user's natural language to exact column names using synonyms provided

[44] [44]

For derived metrics, use the formulas from metadata

[45] [45]

ALWAYS include a LIMIT clause (default: LIMIT 1000)

[46] [46]

Fix the specific error mentioned

[47] [47]

Ensure column names exactly match db_mapping values

[48] [48]

Use proper aggregation and GROUP BY clauses

[49] [49]

Column X not found

Return ONLY the corrected SQL query, no explanations COMMON ERRORS AND FIXES: - "Column X not found" -> Check db_mapping, use exact name - "Type mismatch" -> Use CAST() for date comparisons - "JOIN error" -> Verify JOIN keys exist in both tables - "WITH clause" -> CTEs allowed, ensure final SELECT has LIMIT Return ONLY the SQL query. B.3 SQL G-Eval Rubric...

[50] [50]

QUERY UNDERSTANDING (Weight: 25%) - Does the answer directly address the user's question? - Is the scope correct (timeframe, geography, metrics)? - Is all information grounded in the data? - Is the detail level appropriate?

[51] [51]

TABLE SELECTION (Weight: 35%) Score 9-10 if correct table, 7-8 if acceptable, lower if wrong table

[52] [52]

SQL RELEVANCE (Weight: 40%) - Are aggregations correct (SUM, AVG, COUNT)? - Are WHERE filters appropriate FOR THE QUESTION ASKED? - Is GROUP BY correct? - Do all columns exist in the table? - Is the SQL logic sound? IMPORTANT: Only require WHERE filters that are NECESSARY to answer the specific question. === HARD-FAIL CAPS === - Wrong table: score <= 4 - ...

[53] [53]

Write a 2-3 sentence narrative summary answering the user's question

[54] [54]

List 2-4 key highlights as bullet points (trends, patterns, anomalies, or notable values)

[55] [55]

Focus on actionable insights

[56] [56]

Mention specific numbers and metrics

[57] [57]

Flag any data quality issues (nulls, zeros, unexpected values) if present FORMAT: Summary: [2-3 sentence overview] Key Highlights: - [First key finding with specific numbers] - [Second key finding] - [Additional findings as relevant] Return ONLY the formatted insights