arxiv: 2512.04292 · v1 · submitted 2025-12-03 · 💻 cs.CL

SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Chinmay Gondhalekar , Urjitkumar Patel , Fang-Chun Yeh This is my paper

Pith reviewed 2026-05-17 01:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords spreadsheet question answeringhybrid retrievaltable understandingmulti-header tablesmerged cellsSQL generationadaptive routing

0 comments p. Extension

The pith

SQuARE routes spreadsheet questions via a complexity score from headers and merges to either structure-preserving chunks or SQL for higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SQuARE to solve question answering over real spreadsheets that contain multi-row headers, merged cells, and unit annotations. It calculates a continuous score from header depth and merge density to choose between keeping the original structure in chunks or building an automatic relational view for SQL. A lightweight agent steps in to supervise or combine results when the routing confidence is low. This matters because naive chunking loses hierarchy while rigid SQL fails on inconsistent schemas, and the hybrid keeps values faithful to source cells for easy verification. The evaluations on corporate balance sheets, a merged World Bank workbook, and public datasets show it beats single-strategy baselines and ChatGPT-4o on precision and end-to-end accuracy with stable latency.

Core claim

SQuARE is a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify.

What carries the argument

The continuous complexity score from header depth and merge density that routes queries between structure-preserving chunk retrieval and automatic SQL representation.

If this is right

Returned values stay faithful to original cells with preserved header hierarchies and units for straightforward verification.
The system surpasses single-strategy baselines and ChatGPT-4o on retrieval precision and answer accuracy across corporate balance sheets and heavily merged workbooks.
Latency remains predictable regardless of table complexity.
Retrieval is decoupled from the underlying model, allowing compatibility with future tabular foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same header-and-merge scoring approach could be applied to other irregular tabular formats such as annotated CSV files.
Perturbing the routing score on existing test tables would quantify how much accuracy depends on correct path selection.
Combining SQuARE with larger multi-step agents might reduce the need for the lightweight supervisor on ambiguous queries.

Load-bearing premise

A continuous score computed from header depth and merge density reliably predicts which retrieval path will perform best on a given table.

What would settle it

A test set of multi-header tables where the score selects the worse-performing path and end-to-end accuracy drops below the stronger single-strategy baseline.

Figures

Figures reproduced from arXiv: 2512.04292 by Chinmay Gondhalekar, Fang-Chun Yeh, Urjitkumar Patel.

**Figure 2.** Figure 2: Accuracy by task and model. The router chooses chunk vs constrained [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: LLM Invocation Budget per Query flat tables, Chunk-only underperforms SQL-only, consistent with the advantage of deterministic filters and aggregates; the full system recovers the remaining gap via routing and merging. F. LLM Invocation Budget We distinguish between offline (indexing) and online (query time) LLM calls. Offline, chunk descriptions are generated once in a batched step. Online, per-query LLM … view at source ↗

read the original abstract

Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SQuARE's complexity-based routing between chunk retrieval and SQL is a reasonable engineering idea for messy spreadsheets, but the abstract gives no evidence that the routing score actually improves results over just running both paths.

read the letter

The paper describes a hybrid system that scores each sheet by header depth and merge density, then routes a query to either structure-preserving chunk retrieval or an automatically built SQL view, with a lightweight agent stepping in on low-confidence cases. The goal is to keep header hierarchies, units, and time labels intact so answers stay verifiable against the original cells. This setup targets real corporate balance sheets and merged public workbooks where plain chunking or rigid SQL both break down. The integration of sheet-level scoring with dual paths plus agent oversight is the concrete new piece; it is not just another restatement of prior hybrid retrieval work. The practical framing around decoupling retrieval from the underlying model is also sensible and could let the approach plug into newer tabular foundation models. The main gap is that nothing in the abstract shows the complexity score correlates with which path actually performs better on a given sheet. Without correlation checks, threshold analysis, or an ablation that isolates the routing decision, it is unclear whether the hybrid adds value or whether the gains come from running both methods and letting the agent combine them. The outperformance claims over single-strategy baselines and GPT-4o are stated but come with no numbers, dataset sizes, or error details here, so the central performance story cannot be checked yet. This is the kind of applied system paper that practitioners building data-analysis tools might want to read for the framework and implementation ideas. It is not aimed at readers seeking new theoretical results on table understanding. I would send it to peer review so the authors can supply the missing routing validation and full evaluation numbers.

Referee Report

2 major / 2 minor

Summary. The paper introduces SQuARE, a hybrid retrieval framework for question answering over complex tabular data including multi-header spreadsheets and merged cells. It computes a continuous complexity score from header depth and merge density to route each query to either structure-preserving chunk retrieval or an automatically generated SQL view, with a lightweight agent supervising or combining results on low-confidence cases. The system claims to maintain fidelity to original cell values and to outperform single-strategy baselines plus ChatGPT-4o on retrieval precision and end-to-end answer accuracy across corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, while preserving predictable latency.

Significance. If the routing score is shown to correlate with path superiority, the approach could provide a practical, model-agnostic bridge between chunk-based and structured-query methods for heterogeneous tables. The explicit preservation of header hierarchies, time labels, and units is a clear strength, as is the decoupling from any particular foundation model.

major comments (2)

[Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.
[Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.

minor comments (2)

[Method] Clarify the exact formula for the continuous complexity score and how the low-confidence threshold for agent intervention is set.
[Experiments] Add a table comparing latency and accuracy across the three routing strategies (chunk-only, SQL-only, hybrid) on the same query sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions of our routing mechanism and strengthen the empirical support for our claims. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.

Authors: We agree that an explicit analysis of the routing score's contribution is necessary to substantiate the hybrid design. In the revised manuscript we will add a dedicated subsection under Evaluation that reports (i) Pearson and Spearman correlations between the continuous complexity score and the per-query performance delta between the chunk-based and SQL-based paths, (ii) the empirical derivation of the routing threshold from a held-out validation split, and (iii) an ablation that compares the full SQuARE system against two non-adaptive variants (always-chunk and always-SQL) on the same query sets. These additions will isolate the benefit attributable to the header-depth/merge-density routing from the strengths of the individual retrievers. revision: yes
Referee: [Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.

Authors: We acknowledge that the current Evaluation section relies on high-level statements without sufficient numerical detail. The revised version will expand this section to report exact precision@K and end-to-end accuracy figures for SQuARE, all single-strategy baselines, and GPT-4o on each dataset (corporate balance sheets, World Bank workbook, and public benchmarks). We will include dataset cardinalities, standard deviations or error bars across multiple runs, paired statistical significance tests (e.g., McNemar or t-tests with p-values), and explicit exclusion criteria for any queries or sheets. These quantitative results will be presented in new tables and will directly support the abstract claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected in routing score or system claims

full rationale

The paper describes a hybrid retrieval framework that computes a continuous score from header depth and merge density to route queries to either structure-preserving chunk retrieval or SQL over an auto-generated view, with an agent for low-confidence cases. No equations, derivations, or fitted parameters are presented that reduce the routing decision or performance claims to inputs defined within the paper itself. The abstract and system description rely on external evaluation across corporate balance sheets, World Bank data, and public datasets, with comparisons to single-strategy baselines and ChatGPT-4o, rather than any self-referential fitting or self-citation load-bearing steps. The central mechanism is presented as an engineering choice supported by empirical results, not a derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on domain assumptions about spreadsheet structure rather than new mathematical axioms or invented physical entities. No free parameters are explicitly fitted in the abstract description.

axioms (1)

domain assumption Spreadsheet tables contain header hierarchies, merged cells, and unit annotations that must be preserved for faithful answers.
Invoked to justify the structure-preserving path and the need for complexity-aware routing.

pith-pipeline@v0.9.0 · 5490 in / 1253 out tokens · 50328 ms · 2026-05-17T01:39:37.541395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

X=α H+β M ... Class(W) = Multi-Header if H≥2 or d≥ρ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

[1]

Hello GPT-4o,

“Hello GPT-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: December 18, 2024

work page 2024
[3]

Tablerag: A retrieval augmented generation framework for heterogeneous document reasoning.arXiv preprint arXiv:2506.10380,

X. Yu, P. Jian, and C. Chen, “Tablerag: A retrieval augmented generation framework for heterogeneous document reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.10380

work page arXiv 2025
[4]

Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,

L. Yu and et al., “Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,”OpenReview, 2025. [Online]. Available: https://openreview.net/forum?id=hz2zhaZPXm

work page 2025
[5]

Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,

A. Khanna and et al., “Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,”arXiv preprint arXiv:2407.12345, 2024. [Online]. Available: https://arxiv.org/abs/2407. 12345

work page arXiv 2024
[6]

Structured retrieval-augmented generation for tables,

I. Report, “Structured retrieval-augmented generation for tables,” 2023, uRL available on request

work page 2023
[7]

Agentic nl2sql to reduce computational costs,

D. Jehle, L. Purucker, and F. Hutter, “Agentic nl2sql to reduce computational costs,” 2025. [Online]. Available: https://arxiv.org/abs/ 2510.14808

work page arXiv 2025
[8]

Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,

U. Patel, F. Yeh, and C. Gondhalekar, “Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,”arXiv preprint arXiv:2405.06772, 2024

work page arXiv 2024
[9]

Fanal – financial activity news alerting language modeling framework,

U. Patel, F. Yeh, C. Gondhalekar, and H. Nalluri, “Fanal – financial activity news alerting language modeling framework,”arXiv preprint arXiv:2412.03527, 2024

work page arXiv 2024
[10]

Tabert: Pretraining for joint understanding of textual and tabular data,

P. Yin and G. Neubig, “Tabert: Pretraining for joint understanding of textual and tabular data,” inACL, 2020, pp. 841–853

work page 2020
[11]

Tapas: Weakly supervised table parsing via pre- training,

J. Herzig and et al., “Tapas: Weakly supervised table parsing via pre- training,” inEMNLP, 2020, pp. 4320–4333

work page 2020
[12]

Tabbie: Pretraining for table representation learning,

Q. Zhang and et al., “Tabbie: Pretraining for table representation learning,”arXiv preprint arXiv:2109.08621, 2021

work page arXiv 2021
[13]

Tabpfn: Approximating bayesian neural networks with transformers for tabular data,

F. Hollmann and et al., “Tabpfn: Approximating bayesian neural networks with transformers for tabular data,”Nature Machine Intelligence, 2025. [Online]. Available: https://www.nature.com/articles/ s41586-024-08328-6

work page 2025
[14]

A closer look at tabpfn v2: Strength, limitation, and extension.arXiv preprint arXiv:2502.17361, 2025

H.-J. Ye, S.-Y. Liu, and W.-L. Chao, “A closer look at tabpfn v2: Understanding its strengths and extending its capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2502.17361

work page arXiv 2025
[15]

Tabicl: Scaling tabular foundation models with in- context learning,

Y. Qu and et al., “Tabicl: Scaling tabular foundation models with in- context learning,” inICML, 2025

work page 2025
[16]

Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,

X. Ma and et al., “Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,”arXiv preprint arXiv:2404.12345, 2024

work page arXiv 2024
[17]

Why tabular foundation models should be the focus of ai research,

F. van Breugel and M. van der Schaar, “Why tabular foundation models should be the focus of ai research,”ICML Workshop on Foundation Models, 2024. [Online]. Available: https://arxiv.org/abs/2405.01147

work page arXiv 2024
[18]

BERTScore: Evaluating Text Generation with BERT

T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09675

work page internal anchor Pith review Pith/arXiv arXiv 2020
[19]

Gemma: Open Models Based on Gemini Research and Technology

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Introducing llama 3.1: Our most capable models to date,

“Introducing llama 3.1: Our most capable models to date,” https://ai. meta.com/blog/meta-llama-3-1/, 2024, accessed: December 18, 2024

work page 2024
[21]

Google colaboratory,

Google, “Google colaboratory,” https://colab.research.google.com/, 2023, accessed: May 15, 2025

work page 2023
[22]

Gpu schedules architecture notebook,

A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Gpu schedules architecture notebook,” https://colab.research.google.com/github/d2l-ai/ d2l-tvm-colab/blob/master/chapter gpu schedules/arch.ipynb, 2023, accessed: May 16, 2025

work page 2023
[23]

Multifinrag: An optimized multimodal retrieval-augmented generation (rag) framework for financial question answering,

C. Gondhalekar, U. Patel, and F.-C. Yeh, “Multifinrag: An optimized multimodal retrieval-augmented generation (rag) framework for financial question answering,” 2025

work page 2025
[24]

Introducing gpt-5,

OpenAI, “Introducing gpt-5,” https://openai.com/blog/ introducing-gpt-5, 2025, accessed: 2025-08-29

work page 2025
[25]

gpt-oss-120b & gpt-oss-20b model card,

OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V. Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, L....

work page
[26]

gpt-oss-120b & gpt-oss-20b Model Card

[Online]. Available: https://arxiv.org/abs/2508.10925

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Large language models for table processing: A survey,

Y. Lu and et al., “Large language models for table processing: A survey,”arXiv preprint, 2025. [Online]. Avail- able: https://scholar.google.com/scholar?q=Large+Language+Models+ for+Table+Processing:+A+Survey+Lu+2025

work page 2025
[28]

Table models are few-shot learners? xtformer for cross-table learning,

Y. Zhang and et al., “Table models are few-shot learners? xtformer for cross-table learning,”arXiv preprint arXiv:2411.04036, 2024. [Online]. Available: https://arxiv.org/abs/2411.04036

work page arXiv 2024