MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Jaeyoung Do; Jusang Oh; Sieun Hyeon; Sunghwan Steve Cho

arxiv: 2602.09642 · v2 · submitted 2026-02-10 · 💻 cs.CL · cs.AI

MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Sieun Hyeon , Jusang Oh , Sunghwan Steve Cho , Jaeyoung Do This is my paper

Pith reviewed 2026-05-16 05:34 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Table Question AnsweringMulti-Agent FrameworkLarge Language ModelsSmall Language ModelsReasoning PathsEfficiency OptimizationAnswer Selection

0 comments

The pith

MATA achieves state-of-the-art table question answering accuracy using diverse reasoning paths and small language model tools while minimizing large model calls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MATA as a multi-agent framework for table question answering that creates multiple candidate answers through different reasoning styles. These candidates are then refined or selected using tools constructed from small language models. An efficiency algorithm reduces the number of calls to large language models. The framework delivers strong results on benchmarks of varying difficulty using ten different LLMs, including small open-source ones. This setup supports reliable and flexible performance in settings where resources or privacy are concerns.

Core claim

MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools built with small language models. It incorporates an algorithm designed to minimize expensive LLM agent calls, allowing it to maintain strong performance with small, open-source models and adapt across various LLM types.

What carries the argument

Multi-agent orchestration of complementary reasoning paths that produce answer candidates, combined with small language model tools for refinement or selection and an algorithm that limits LLM calls.

If this is right

MATA achieves state-of-the-art accuracy on two table QA benchmarks.
It enables highly efficient reasoning by avoiding excessive LLM inference.
The framework works well with small open-source models.
It adapts easily to different types of LLMs.
Orchestration of multiple reasoning pathways supports scalable and reliable TableQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Orchestrating diverse paths may help overcome limitations of any single reasoning style in structured data tasks.
Small model tools could potentially be applied to other selection problems in LLM pipelines.
Testing on dynamic or multi-hop table questions might reveal further strengths or limits of the approach.
The efficiency gains suggest viability for deployment in privacy-sensitive environments.

Load-bearing premise

Tools built from small language models can reliably refine or select optimal answers from candidates generated by diverse reasoning paths across varied table-question pairs.

What would settle it

A test set of table-question pairs where the small language model tools select incorrect candidates at a rate significantly higher than the best single reasoning path.

read the original abstract

Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at https://github.com/AIDASLab/MATA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MATA adds multi-agent paths plus small-model tools for TableQA and claims SOTA with fewer big-LLM calls, but the tools' actual reliability is not shown.

read the letter

MATA runs several reasoning paths on a table and question, then hands the candidates to tools built from small language models that refine or pick the final answer, plus a scheduling step that cuts down on expensive LLM calls. It reports better accuracy than prior work on two benchmarks while working with ten different LLMs, including small open-source ones, and the code is released. That combination of diverse paths and lightweight tools is the main new piece. It is useful to see the same framework hold up across model sizes and to have the implementation out for others to try. The experiments cover varying difficulty levels, which is a plus for practical claims. The soft spot is the missing evidence on the small-model tools themselves. The abstract gives no numbers on how often those tools pick correctly, no error analysis on ambiguous cases, and no ablation that compares the full system against simple majority vote from the paths. Without those checks it is hard to tell whether the efficiency and accuracy gains come from the orchestration or from the tools happening to work on these datasets. If the small models add noise on harder tables, both the SOTA and the low-call story rest on an assumption that is not yet tested in the reported results. This is for people building TableQA systems who care about cost and reliability in real deployments. A reader already working on agent orchestration or structured-data reasoning would pick up concrete implementation ideas and see how one setup scales across models. It is worth sending to peer review so the experiments can be examined in full; the idea is clear enough that referees can judge whether the missing ablations are easy to add or point to a deeper gap.

Referee Report

2 major / 1 minor

Summary. The paper introduces MATA, a multi-agent framework for Table Question Answering that generates candidate answers through diverse reasoning paths and employs tools built from small language models to refine or select the optimal answer, while using an algorithm to minimize expensive LLM calls. It reports state-of-the-art accuracy on two benchmarks of varying difficulty using ten different LLMs, with strong performance maintained even when using small open-source models, and provides code at https://github.com/AIDASLab/MATA.

Significance. If the empirical claims hold under proper verification, MATA would represent a meaningful advance in reliable and efficient TableQA by showing that careful orchestration of multiple reasoning pathways with small-LM tools can deliver high accuracy without excessive LLM inference, particularly in resource-constrained settings. The open-source code release supports reproducibility and further experimentation.

major comments (2)

[Abstract] Abstract: the central claim that small-LM tools reliably refine or select optimal answers from diverse reasoning paths is load-bearing for both the SOTA accuracy and efficiency results, yet no quantitative evaluation of tool accuracy, error rates, or ablation (with vs. without tools) is provided; this leaves the weakest assumption untested across the reported table-question pairs.
[Experiments] Experiments section (implied by abstract claims): the SOTA results on two benchmarks with ten LLMs are reported without details on exact baselines, error bars, data splits, or statistical significance tests, making it impossible to verify whether the gains are robust or merely incremental over majority voting.

minor comments (1)

[Abstract] Abstract: the phrase 'highly efficient reasoning' is used without defining the metric (e.g., number of LLM calls saved or wall-clock time) relative to prior multi-agent baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve clarity and strengthen the empirical support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that small-LM tools reliably refine or select optimal answers from diverse reasoning paths is load-bearing for both the SOTA accuracy and efficiency results, yet no quantitative evaluation of tool accuracy, error rates, or ablation (with vs. without tools) is provided; this leaves the weakest assumption untested across the reported table-question pairs.

Authors: We acknowledge that a dedicated quantitative evaluation of the small-LM tools would provide stronger evidence for their contribution. The manuscript reports end-to-end results across ten LLMs and two benchmarks, but does not isolate tool-level accuracy or error rates. In the revised version, we will add an ablation study (with vs. without tools), report tool accuracy metrics on the table-question pairs, and include error analysis to directly test this assumption. revision: yes
Referee: [Experiments] Experiments section (implied by abstract claims): the SOTA results on two benchmarks with ten LLMs are reported without details on exact baselines, error bars, data splits, or statistical significance tests, making it impossible to verify whether the gains are robust or merely incremental over majority voting.

Authors: We agree that additional experimental details are necessary for verification. The current experiments section compares MATA against standard baselines including majority voting, but lacks explicit error bars, precise data split descriptions, and significance testing. We will revise the section to list all baselines explicitly, report standard deviations from multiple runs, specify the exact splits used, and add statistical significance tests (e.g., paired t-tests) against the majority-voting baseline to confirm robustness. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces the MATA multi-agent framework and reports its performance via experiments on two external benchmarks across ten LLMs. No mathematical derivations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the methodology or results. Claims of SOTA accuracy and efficiency are grounded in reported empirical outcomes rather than reducing to the framework's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; framework appears to rely on standard LLM capabilities and empirical tuning.

pith-pipeline@v0.9.0 · 5496 in / 1076 out tokens · 60713 ms · 2026-05-16T05:34:04.507267+00:00 · methodology

MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)