pith. the verified trust layer for science. sign in

arxiv: 2512.04292 · v1 · submitted 2025-12-03 · 💻 cs.CL

SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

Pith reviewed 2026-05-17 01:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords spreadsheet question answeringhybrid retrievaltable understandingmulti-header tablesmerged cellsSQL generationadaptive routing
0
0 comments X p. Extension

The pith

SQuARE routes spreadsheet questions via a complexity score from headers and merges to either structure-preserving chunks or SQL for higher accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SQuARE to solve question answering over real spreadsheets that contain multi-row headers, merged cells, and unit annotations. It calculates a continuous score from header depth and merge density to choose between keeping the original structure in chunks or building an automatic relational view for SQL. A lightweight agent steps in to supervise or combine results when the routing confidence is low. This matters because naive chunking loses hierarchy while rigid SQL fails on inconsistent schemas, and the hybrid keeps values faithful to source cells for easy verification. The evaluations on corporate balance sheets, a merged World Bank workbook, and public datasets show it beats single-strategy baselines and ChatGPT-4o on precision and end-to-end accuracy with stable latency.

Core claim

SQuARE is a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify.

What carries the argument

The continuous complexity score from header depth and merge density that routes queries between structure-preserving chunk retrieval and automatic SQL representation.

If this is right

  • Returned values stay faithful to original cells with preserved header hierarchies and units for straightforward verification.
  • The system surpasses single-strategy baselines and ChatGPT-4o on retrieval precision and answer accuracy across corporate balance sheets and heavily merged workbooks.
  • Latency remains predictable regardless of table complexity.
  • Retrieval is decoupled from the underlying model, allowing compatibility with future tabular foundation models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same header-and-merge scoring approach could be applied to other irregular tabular formats such as annotated CSV files.
  • Perturbing the routing score on existing test tables would quantify how much accuracy depends on correct path selection.
  • Combining SQuARE with larger multi-step agents might reduce the need for the lightweight supervisor on ambiguous queries.

Load-bearing premise

A continuous score computed from header depth and merge density reliably predicts which retrieval path will perform best on a given table.

What would settle it

A test set of multi-header tables where the score selects the worse-performing path and end-to-end accuracy drops below the stronger single-strategy baseline.

Figures

Figures reproduced from arXiv: 2512.04292 by Chinmay Gondhalekar, Fang-Chun Yeh, Urjitkumar Patel.

Figure 1
Figure 1. Figure 1: Example balance sheet (Microsoft, FY2020–2024). Queries such as [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy by task and model. The router chooses chunk vs constrained [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LLM Invocation Budget per Query flat tables, Chunk-only underperforms SQL-only, consistent with the advantage of deterministic filters and aggregates; the full system recovers the remaining gap via routing and merging. F. LLM Invocation Budget We distinguish between offline (indexing) and online (query time) LLM calls. Offline, chunk descriptions are generated once in a batched step. Online, per-query LLM … view at source ↗
read the original abstract

Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SQuARE, a hybrid retrieval framework for question answering over complex tabular data including multi-header spreadsheets and merged cells. It computes a continuous complexity score from header depth and merge density to route each query to either structure-preserving chunk retrieval or an automatically generated SQL view, with a lightweight agent supervising or combining results on low-confidence cases. The system claims to maintain fidelity to original cell values and to outperform single-strategy baselines plus ChatGPT-4o on retrieval precision and end-to-end answer accuracy across corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, while preserving predictable latency.

Significance. If the routing score is shown to correlate with path superiority, the approach could provide a practical, model-agnostic bridge between chunk-based and structured-query methods for heterogeneous tables. The explicit preservation of header hierarchies, time labels, and units is a clear strength, as is the decoupling from any particular foundation model.

major comments (2)
  1. [Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.
  2. [Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.
minor comments (2)
  1. [Method] Clarify the exact formula for the continuous complexity score and how the low-confidence threshold for agent intervention is set.
  2. [Experiments] Add a table comparing latency and accuracy across the three routing strategies (chunk-only, SQL-only, hybrid) on the same query sets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the contributions of our routing mechanism and strengthen the empirical support for our claims. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Routing and Agent Supervision] The central performance claim rests on the routing decision, yet the manuscript provides no correlation analysis, threshold derivation, or ablation that isolates the contribution of the header-depth/merge-density score from the two underlying retrievers. Without this, it is unclear whether the hybrid gains exceed what either path achieves alone.

    Authors: We agree that an explicit analysis of the routing score's contribution is necessary to substantiate the hybrid design. In the revised manuscript we will add a dedicated subsection under Evaluation that reports (i) Pearson and Spearman correlations between the continuous complexity score and the per-query performance delta between the chunk-based and SQL-based paths, (ii) the empirical derivation of the routing threshold from a held-out validation split, and (iii) an ablation that compares the full SQuARE system against two non-adaptive variants (always-chunk and always-SQL) on the same query sets. These additions will isolate the benefit attributable to the header-depth/merge-density routing from the strengths of the individual retrievers. revision: yes

  2. Referee: [Evaluation] Evaluation section: the abstract asserts consistent outperformance on precision and accuracy, but the provided description contains no quantitative metrics, error bars, dataset sizes, statistical tests, or exclusion criteria, preventing verification that the data support the central claim.

    Authors: We acknowledge that the current Evaluation section relies on high-level statements without sufficient numerical detail. The revised version will expand this section to report exact precision@K and end-to-end accuracy figures for SQuARE, all single-strategy baselines, and GPT-4o on each dataset (corporate balance sheets, World Bank workbook, and public benchmarks). We will include dataset cardinalities, standard deviations or error bars across multiple runs, paired statistical significance tests (e.g., McNemar or t-tests with p-values), and explicit exclusion criteria for any queries or sheets. These quantitative results will be presented in new tables and will directly support the abstract claims. revision: yes

Circularity Check

0 steps flagged

No circularity detected in routing score or system claims

full rationale

The paper describes a hybrid retrieval framework that computes a continuous score from header depth and merge density to route queries to either structure-preserving chunk retrieval or SQL over an auto-generated view, with an agent for low-confidence cases. No equations, derivations, or fitted parameters are presented that reduce the routing decision or performance claims to inputs defined within the paper itself. The abstract and system description rely on external evaluation across corporate balance sheets, World Bank data, and public datasets, with comparisons to single-strategy baselines and ChatGPT-4o, rather than any self-referential fitting or self-citation load-bearing steps. The central mechanism is presented as an engineering choice supported by empirical results, not a derivation that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on domain assumptions about spreadsheet structure rather than new mathematical axioms or invented physical entities. No free parameters are explicitly fitted in the abstract description.

axioms (1)
  • domain assumption Spreadsheet tables contain header hierarchies, merged cells, and unit annotations that must be preserved for faithful answers.
    Invoked to justify the structure-preserving path and the need for complexity-aware routing.

pith-pipeline@v0.9.0 · 5490 in / 1253 out tokens · 50328 ms · 2026-05-17T01:39:37.541395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    Hello GPT-4o,

    “Hello GPT-4o,” https://openai.com/index/hello-gpt-4o/, 2024, accessed: December 18, 2024

  2. [3]

    Tablerag: A retrieval augmented generation framework for heterogeneous document reasoning.arXiv preprint arXiv:2506.10380,

    X. Yu, P. Jian, and C. Chen, “Tablerag: A retrieval augmented generation framework for heterogeneous document reasoning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.10380

  3. [4]

    Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,

    L. Yu and et al., “Tablerag extended: Mixed retrieval and sql reasoning for tabular qa,”OpenReview, 2025. [Online]. Available: https://openreview.net/forum?id=hz2zhaZPXm

  4. [5]

    Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,

    A. Khanna and et al., “Fine-tuning embedding models for tabular retrieval-augmented generation in financial datasets,”arXiv preprint arXiv:2407.12345, 2024. [Online]. Available: https://arxiv.org/abs/2407. 12345

  5. [6]

    Structured retrieval-augmented generation for tables,

    I. Report, “Structured retrieval-augmented generation for tables,” 2023, uRL available on request

  6. [7]

    Agentic nl2sql to reduce computational costs,

    D. Jehle, L. Purucker, and F. Hutter, “Agentic nl2sql to reduce computational costs,” 2025. [Online]. Available: https://arxiv.org/abs/ 2510.14808

  7. [8]

    Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,

    U. Patel, F. Yeh, and C. Gondhalekar, “Canal – cyber activity news alerting language model: Empirical approach vs. expensive LLM,”arXiv preprint arXiv:2405.06772, 2024

  8. [9]

    Fanal – financial activity news alerting language modeling framework,

    U. Patel, F. Yeh, C. Gondhalekar, and H. Nalluri, “Fanal – financial activity news alerting language modeling framework,”arXiv preprint arXiv:2412.03527, 2024

  9. [10]

    Tabert: Pretraining for joint understanding of textual and tabular data,

    P. Yin and G. Neubig, “Tabert: Pretraining for joint understanding of textual and tabular data,” inACL, 2020, pp. 841–853

  10. [11]

    Tapas: Weakly supervised table parsing via pre- training,

    J. Herzig and et al., “Tapas: Weakly supervised table parsing via pre- training,” inEMNLP, 2020, pp. 4320–4333

  11. [12]

    Tabbie: Pretraining for table representation learning,

    Q. Zhang and et al., “Tabbie: Pretraining for table representation learning,”arXiv preprint arXiv:2109.08621, 2021

  12. [13]

    Tabpfn: Approximating bayesian neural networks with transformers for tabular data,

    F. Hollmann and et al., “Tabpfn: Approximating bayesian neural networks with transformers for tabular data,”Nature Machine Intelligence, 2025. [Online]. Available: https://www.nature.com/articles/ s41586-024-08328-6

  13. [14]

    A closer look at tabpfn v2: Strength, limitation, and extension.arXiv preprint arXiv:2502.17361, 2025

    H.-J. Ye, S.-Y. Liu, and W.-L. Chao, “A closer look at tabpfn v2: Understanding its strengths and extending its capabilities,” 2025. [Online]. Available: https://arxiv.org/abs/2502.17361

  14. [15]

    Tabicl: Scaling tabular foundation models with in- context learning,

    Y. Qu and et al., “Tabicl: Scaling tabular foundation models with in- context learning,” inICML, 2025

  15. [16]

    Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,

    X. Ma and et al., “Tabdpt: Retrieval-pretrained tabular transformer for zero-shot and few-shot learning,”arXiv preprint arXiv:2404.12345, 2024

  16. [17]

    Why tabular foundation models should be the focus of ai research,

    F. van Breugel and M. van der Schaar, “Why tabular foundation models should be the focus of ai research,”ICML Workshop on Foundation Models, 2024. [Online]. Available: https://arxiv.org/abs/2405.01147

  17. [18]

    BERTScore: Evaluating Text Generation with BERT

    T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” 2020. [Online]. Available: https://arxiv.org/abs/1904.09675

  18. [19]

    Gemma: Open Models Based on Gemini Research and Technology

    G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Loveet al., “Gemma: Open models based on gemini research and technology,”arXiv preprint arXiv:2403.08295, 2024

  19. [20]

    Introducing llama 3.1: Our most capable models to date,

    “Introducing llama 3.1: Our most capable models to date,” https://ai. meta.com/blog/meta-llama-3-1/, 2024, accessed: December 18, 2024

  20. [21]

    Google colaboratory,

    Google, “Google colaboratory,” https://colab.research.google.com/, 2023, accessed: May 15, 2025

  21. [22]

    Gpu schedules architecture notebook,

    A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, “Gpu schedules architecture notebook,” https://colab.research.google.com/github/d2l-ai/ d2l-tvm-colab/blob/master/chapter gpu schedules/arch.ipynb, 2023, accessed: May 16, 2025

  22. [23]

    Multifinrag: An optimized multimodal retrieval-augmented generation (rag) framework for financial question answering,

    C. Gondhalekar, U. Patel, and F.-C. Yeh, “Multifinrag: An optimized multimodal retrieval-augmented generation (rag) framework for financial question answering,” 2025

  23. [24]

    Introducing gpt-5,

    OpenAI, “Introducing gpt-5,” https://openai.com/blog/ introducing-gpt-5, 2025, accessed: 2025-08-29

  24. [25]

    gpt-oss-120b & gpt-oss-20b model card,

    OpenAI, :, S. Agarwal, L. Ahmad, J. Ai, S. Altman, A. Applebaum, E. Arbus, R. K. Arora, Y. Bai, B. Baker, H. Bao, B. Barak, A. Bennett, T. Bertao, N. Brett, E. Brevdo, G. Brockman, S. Bubeck, C. Chang, K. Chen, M. Chen, E. Cheung, A. Clark, D. Cook, M. Dukhan, C. Dvorak, K. Fives, V. Fomenko, T. Garipov, K. Georgiev, M. Glaese, T. Gogineni, A. Goucher, L....

  25. [26]

    gpt-oss-120b & gpt-oss-20b Model Card

    [Online]. Available: https://arxiv.org/abs/2508.10925

  26. [27]

    Large language models for table processing: A survey,

    Y. Lu and et al., “Large language models for table processing: A survey,”arXiv preprint, 2025. [Online]. Avail- able: https://scholar.google.com/scholar?q=Large+Language+Models+ for+Table+Processing:+A+Survey+Lu+2025

  27. [28]

    Table models are few-shot learners? xtformer for cross-table learning,

    Y. Zhang and et al., “Table models are few-shot learners? xtformer for cross-table learning,”arXiv preprint arXiv:2411.04036, 2024. [Online]. Available: https://arxiv.org/abs/2411.04036