SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats
Pith reviewed 2026-05-17 01:39 UTC · model grok-4.3
The pith
SQuARE routes spreadsheet questions via a complexity score from headers and merges to either structure-preserving chunks or SQL for higher accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SQuARE is a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify.
What carries the argument
The continuous complexity score from header depth and merge density that routes queries between structure-preserving chunk retrieval and automatic SQL representation.
If this is right
- Returned values stay faithful to original cells with preserved header hierarchies and units for straightforward verification.
- The system surpasses single-strategy baselines and ChatGPT-4o on retrieval precision and answer accuracy across corporate balance sheets and heavily merged workbooks.
- Latency remains predictable regardless of table complexity.
- Retrieval is decoupled from the underlying model, allowing compatibility with future tabular foundation models.
Where Pith is reading between the lines
- The same header-and-merge scoring approach could be applied to other irregular tabular formats such as annotated CSV files.
- Perturbing the routing score on existing test tables would quantify how much accuracy depends on correct path selection.
- Combining SQuARE with larger multi-step agents might reduce the need for the lightweight supervisor on ambiguous queries.
Load-bearing premise
A continuous score computed from header depth and merge density reliably predicts which retrieval path will perform best on a given table.
What would settle it
A test set of multi-header tables where the score selects the worse-performing path and end-to-end accuracy drops below the stronger single-strategy baseline.
Figures
read the original abstract
Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SQuARE, a hybrid retrieval framework for question answering over complex tabular data including multi-header spreadsheets and merged cells. It computes a continuous complexity score from header depth and merge density to route each query to either structure-preserving chunk retrieval or an automatically generated SQL view, with a lightweight agent supervising or combining results on low-confidence cases. The system claims to maintain fidelity to original cell values and to outperform single-strategy baselines plus ChatGPT-4o on retrieval precision and end-to-end answer accuracy across corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, while preserving predictable latency.
Significance. If the routing score is shown to correlate with path superiority, the approach could provide a practical, model-agnostic bridge between chunk-based and structured-query methods for heterogeneous tables. The explicit preservation of header hierarchies, time labels, and units is a clear strength, as is the decoupling from any particular foundation model.
major comments (2)
- [Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.
- [Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.
minor comments (2)
- [Method] Clarify the exact formula for the continuous complexity score and how the low-confidence threshold for agent intervention is set.
- [Experiments] Add a table comparing latency and accuracy across the three routing strategies (chunk-only, SQL-only, hybrid) on the same query sets.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the contributions of our routing mechanism and strengthen the empirical support for our claims. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.
Authors: We agree that an explicit analysis of the routing score's contribution is necessary to substantiate the hybrid design. In the revised manuscript we will add a dedicated subsection under Evaluation that reports (i) Pearson and Spearman correlations between the continuous complexity score and the per-query performance delta between the chunk-based and SQL-based paths, (ii) the empirical derivation of the routing threshold from a held-out validation split, and (iii) an ablation that compares the full SQuARE system against two non-adaptive variants (always-chunk and always-SQL) on the same query sets. These additions will isolate the benefit attributable to the header-depth/merge-density routing from the strengths of the individual retrievers. revision: yes
-
Referee: [Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.
Authors: We acknowledge that the current Evaluation section relies on high-level statements without sufficient numerical detail. The revised version will expand this section to report exact precision@K and end-to-end accuracy figures for SQuARE, all single-strategy baselines, and GPT-4o on each dataset (corporate balance sheets, World Bank workbook, and public benchmarks). We will include dataset cardinalities, standard deviations or error bars across multiple runs, paired statistical significance tests (e.g., McNemar or t-tests with p-values), and explicit exclusion criteria for any queries or sheets. These quantitative results will be presented in new tables and will directly support the abstract claims. revision: yes
Circularity Check
No circularity detected in routing score or system claims
full rationale
The paper describes a hybrid retrieval framework that computes a continuous score from header depth and merge density to route queries to either structure-preserving chunk retrieval or SQL over an auto-generated view, with an agent for low-confidence cases. No equations, derivations, or fitted parameters are presented that reduce the routing decision or performance claims to inputs defined within the paper itself. The abstract and system description rely on external evaluation across corporate balance sheets, World Bank data, and public datasets, with comparisons to single-strategy baselines and ChatGPT-4o, rather than any self-referential fitting or self-citation load-bearing steps. The central mechanism is presented as an engineering choice supported by empirical results, not a derivation that collapses by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Spreadsheet tables contain header hierarchies, merged cells, and unit annotations that must be preserved for faithful answers.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
X=α H+β M ... Class(W) = Multi-Header if H≥2 or d≥ρ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
“Hello GPT-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: December 18, 2024
work page 2024
-
[3]
X. Yu, P. Jian, and C. Chen, “Tablerag: A retrieval augmented generation framework for heterogeneous document reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.10380
-
[4]
Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,
L. Yu and et al., “Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,”OpenReview, 2025. [Online]. Available: https://openreview.net/forum?id=hz2zhaZPXm
work page 2025
-
[5]
Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,
A. Khanna and et al., “Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,”arXiv preprint arXiv:2407.12345, 2024. [Online]. Available: https://arxiv.org/abs/2407. 12345
-
[6]
Structured retrieval-augmented generation for tables,
I. Report, “Structured retrieval-augmented generation for tables,” 2023, uRL available on request
work page 2023
-
[7]
Agentic nl2sql to reduce computational costs,
D. Jehle, L. Purucker, and F. Hutter, “Agentic nl2sql to reduce computational costs,” 2025. [Online]. Available: https://arxiv.org/abs/ 2510.14808
-
[8]
Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,
U. Patel, F. Yeh, and C. Gondhalekar, “Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,”arXiv preprint arXiv:2405.06772, 2024
-
[9]
Fanal – financial activity news alerting language modeling framework,
U. Patel, F. Yeh, C. Gondhalekar, and H. Nalluri, “Fanal – financial activity news alerting language modeling framework,”arXiv preprint arXiv:2412.03527, 2024
-
[10]
Tabert: Pretraining for joint understanding of textual and tabular data,
P. Yin and G. Neubig, “Tabert: Pretraining for joint understanding of textual and tabular data,” inACL, 2020, pp. 841–853
work page 2020
-
[11]
Tapas: Weakly supervised table parsing via pre- training,
J. Herzig and et al., “Tapas: Weakly supervised table parsing via pre- training,” inEMNLP, 2020, pp. 4320–4333
work page 2020
-
[12]
Tabbie: Pretraining for table representation learning,
Q. Zhang and et al., “Tabbie: Pretraining for table representation learning,”arXiv preprint arXiv:2109.08621, 2021
-
[13]
Tabpfn: Approximating bayesian neural networks with transformers for tabular data,
F. Hollmann and et al., “Tabpfn: Approximating bayesian neural networks with transformers for tabular data,”Nature Machine Intelligence, 2025. [Online]. Available: https://www.nature.com/articles/ s41586-024-08328-6
work page 2025
-
[14]
H.-J. Ye, S.-Y. Liu, and W.-L. Chao, “A closer look at tabpfn v2: Understanding its strengths and extending its capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2502.17361
-
[15]
Tabicl: Scaling tabular foundation models with in- context learning,
Y. Qu and et al., “Tabicl: Scaling tabular foundation models with in- context learning,” inICML, 2025
work page 2025
-
[16]
Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,
X. Ma and et al., “Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,”arXiv preprint arXiv:2404.12345, 2024
-
[17]
Why tabular foundation models should be the focus of ai research,
F. van Breugel and M. van der Schaar, “Why tabular foundation models should be the focus of ai research,”ICML Workshop on Foundation Models, 2024. [Online]. Available: https://arxiv.org/abs/2405.01147
-
[18]
BERTScore: Evaluating Text Generation with BERT
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09675
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[19]
Gemma: Open Models Based on Gemini Research and Technology
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Introducing llama 3.1: Our most capable models to date,
“Introducing llama 3.1: Our most capable models to date,” https://ai. meta.com/blog/meta-llama-3-1/, 2024, accessed: December 18, 2024
work page 2024
-
[21]
Google, “Google colaboratory,” https://colab.research.google.com/, 2023, accessed: May 15, 2025
work page 2023
-
[22]
Gpu schedules architecture notebook,
A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Gpu schedules architecture notebook,” https://colab.research.google.com/github/d2l-ai/ d2l-tvm-colab/blob/master/chapter gpu schedules/arch.ipynb, 2023, accessed: May 16, 2025
work page 2023
-
[23]
C. Gondhalekar, U. Patel, and F.-C. Yeh, “Multifinrag: An optimized multimodal retrieval-augmented generation (rag) framework for financial question answering,” 2025
work page 2025
-
[24]
OpenAI, “Introducing gpt-5,” https://openai.com/blog/ introducing-gpt-5, 2025, accessed: 2025-08-29
work page 2025
-
[25]
gpt-oss-120b & gpt-oss-20b model card,
OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V. Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, L....
-
[26]
gpt-oss-120b & gpt-oss-20b Model Card
[Online]. Available: https://arxiv.org/abs/2508.10925
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Large language models for table processing: A survey,
Y. Lu and et al., “Large language models for table processing: A survey,”arXiv preprint, 2025. [Online]. Avail- able: https://scholar.google.com/scholar?q=Large+Language+Models+ for+Table+Processing:+A+Survey+Lu+2025
work page 2025
-
[28]
Table models are few-shot learners? xtformer for cross-table learning,
Y. Zhang and et al., “Table models are few-shot learners? xtformer for cross-table learning,”arXiv preprint arXiv:2411.04036, 2024. [Online]. Available: https://arxiv.org/abs/2411.04036
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.