TabReX : Tabular Referenceless eXplainable Evaluation

Aparna Garimella; Junha Park; Tejas Anvekar; Vivek Gupta

arxiv: 2512.15907 · v2 · submitted 2025-12-17 · 💻 cs.CL

TabReX : Tabular Referenceless eXplainable Evaluation

Tejas Anvekar , Junha Park , Aparna Garimella , Vivek Gupta This is my paper

Pith reviewed 2026-05-16 21:22 UTC · model grok-4.3

classification 💻 cs.CL

keywords tabular evaluationLLM metricsreference-less evaluationknowledge graphsstructured generationexplainable evaluationbenchmark

0 comments

The pith

TabReX evaluates LLM-generated tables without references by turning text and tables into knowledge graphs for fidelity scoring.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Evaluating tables produced by large language models is difficult because existing metrics either discard structure by flattening tables to text or require fixed reference tables that limit use across domains. TabReX offers a reference-less alternative that converts both the source text and the generated table into canonical knowledge graphs, aligns the graphs through an LLM-guided process, and derives rubric-aware scores for structural and factual fidelity. The scores include cell-level error traces and allow users to adjust sensitivity versus specificity. The authors test the approach on TabReX-Bench, a dataset covering six domains and twelve perturbation types, where TabReX shows higher agreement with expert rankings than prior metrics and greater stability under harder perturbations.

Core claim

TabReX converts source text and generated tables into canonical knowledge graphs, aligns them via LLM-guided matching, and computes interpretable rubric-aware scores that quantify structural and factual fidelity, delivering human-aligned judgments along with cell-level error traces and controllable trade-offs between sensitivity and specificity.

What carries the argument

Conversion of text and tables to canonical knowledge graphs followed by LLM-guided alignment to produce rubric-aware fidelity scores.

Load-bearing premise

LLM-guided matching of the resulting knowledge graphs will accurately capture structural and factual fidelity without alignment errors or model biases.

What would settle it

A collection of generated tables in which experts identify clear factual or structural errors that the graph matching process misses, producing high TabReX scores that contradict the expert rankings.

Figures

Figures reproduced from arXiv: 2512.15907 by Aparna Garimella, Junha Park, Tejas Anvekar, Vivek Gupta.

**Figure 1.** Figure 1: Metric Movements Across Difficulty Levels. Arrows show each metric’s shift from easy (blue) to hard (red) perturbations. Axes plot specificity (y) vs. sensitivity (x), with the green region denoting the balanced ideal zone. The dashed diagonal marks the optimal trade-off. TABREX stay near this zone, maintaining right direction even for hard examples. patient dashboards, or reformatting analytical data th… view at source ↗

**Figure 2.** Figure 2: Illustration of propsed TABREX . Both source text and generated tables are converted into knowledge graphs via Text2Graph and Table2Graph, aligned through an LLM-guided Graph Alignment, finally scored by a Property-Driven Scoring function that aggregates alignment statistics into interpretable, controllable table- and cell-level penalties. We propose TABREX , a unified evaluation framework for tabular ge… view at source ↗

**Figure 3.** Figure 3: Perturbation landscape across difficulty and type. The radial stacked donut visualizes the distribution of perturbation types segmented by difficulty: Easy (green), Medium (blue), and Hard (red). The top and bottom semicircles correspond to data-altering and data-preserving transformations, respectively. TABREX-BENCH is a comprehensive benchmark for evaluating tabular metrics under both datapreserving a… view at source ↗

**Figure 4.** Figure 4: Rubric-wise alignment across models and prompting strategies. Top row: cell-level agreement within model across prompts. Bottom row: table-level agreement. Model size and reasoning style influence local precision more than structural coherence, while prompt strategy (like Map&Make (Ahuja et al., 2025)) drives balanced alignment across rubric dimensions [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Evaluating the quality of tables generated by large language models (LLMs) remains an open challenge: existing metrics either flatten tables into text, ignoring structure, or rely on fixed references that limit generalization. We present TabReX, a reference-less, property-driven framework for evaluating tabular generation via graph-based reasoning. TabReX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity. The resulting metric provides controllable trade-offs between sensitivity and specificity, yielding human-aligned judgments and cell-level error traces. To systematically asses metric robustness, we introduce TabReX-Bench, a large-scale benchmark spanning six domains and twelve planner-driven perturbation types across three difficulty tiers. Empirical results show that TabReX achieves the highest correlation with expert rankings, remains stable under harder perturbations, and enables fine-grained model-vs-prompt analysis establishing a new paradigm for trustworthy, explainable evaluation of structured generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TabReX turns tables and text into graphs for reference-free scoring with an LLM matcher, but the alignment step has no reported human validation.

read the letter

TabReX converts source text and generated tables into canonical knowledge graphs, aligns them via LLM-guided matching, and produces rubric-based scores for structure and facts. They also release TabReX-Bench, a benchmark with six domains and twelve perturbation types at three difficulty levels. The benchmark construction and the cell-level error traces are the clearest contributions. Those traces give a practical way to see exactly where a table fails, which is more actionable than a single correlation number. The controllable trade-off between sensitivity and specificity through the rubrics is also a reasonable design choice for different applications. The soft spot is the matching step itself. The abstract claims highest expert correlation and stability under harder perturbations, yet gives no numbers on how often the LLM matcher gets the alignments right against human judgment, no ablation on matcher choice, and no separate test for prompt sensitivity. Without that, the reported correlations could be partly driven by the matcher favoring certain generation styles or simply making consistent errors. The circularity risk is real if the evaluator LLM overlaps with the models under test. This is for researchers who build or use metrics for structured LLM outputs in data pipelines or reporting. A reader focused on evaluation methodology would find the benchmark useful even if the metric needs tightening. The paper deserves peer review because the problem is concrete and the graph-plus-rubric framing is distinct enough from flattening or reference-based baselines to get useful referee input on the alignment validation.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces TabReX, a reference-less, property-driven framework for evaluating LLM-generated tables. It converts both source text and generated tables into canonical knowledge graphs, aligns them via an LLM-guided matching process, and computes interpretable rubric-aware scores quantifying structural and factual fidelity. The paper also presents TabReX-Bench, spanning six domains and twelve planner-driven perturbation types across three difficulty tiers, and reports that TabReX achieves the highest correlation with expert rankings while remaining stable under harder perturbations, enabling fine-grained model-vs-prompt analysis.

Significance. If the central empirical claims hold after addressing the noted gaps, TabReX could establish a new paradigm for trustworthy, explainable evaluation of structured generation systems. It directly tackles limitations of existing metrics that either ignore table structure or require fixed references, and the introduction of a large-scale perturbation benchmark is a positive contribution that could support more robust future evaluations.

major comments (3)

[Abstract] Abstract: The claim that TabReX achieves the highest correlation with expert rankings and remains stable under harder perturbations is presented without any details on the exact scoring formulas, graph construction rules, statistical tests (e.g., Pearson/Spearman coefficients, p-values), or confidence intervals. This absence makes it impossible to assess whether the data support the headline results.
[§3] §3 (Method, LLM-guided matching): The alignment step is load-bearing for all downstream scores, yet the description provides no prompt templates, mismatch-handling rules, or separate validation of matching accuracy against human alignments. Because the matcher is itself an LLM, any prompt sensitivity or model-specific bias could systematically inflate correlations, especially if the evaluator LLM overlaps with the generators being scored.
[§4] §4 (Experiments, TabReX-Bench results): The reported superiority and stability claims rest on the untested assumption that the KG conversion and matching produce faithful structural/factual scores. No ablation on matcher LLM choice, no human alignment accuracy metrics, and no error analysis of hallucinated edges or alignment failures are provided, leaving the central empirical result vulnerable to unmeasured bias.

minor comments (2)

[Abstract] Abstract: Typo in 'To systematically asses metric robustness' (should be 'assess').
[Throughout] Notation for knowledge graphs, alignment scores, and rubric weights should be defined once and used consistently; several terms appear to be introduced without explicit equations or pseudocode.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We agree that additional transparency is needed on scoring details, matching procedures, and validation experiments. We have revised the manuscript to incorporate these elements while preserving the core contributions of TabReX and TabReX-Bench.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that TabReX achieves the highest correlation with expert rankings and remains stable under harder perturbations is presented without any details on the exact scoring formulas, graph construction rules, statistical tests (e.g., Pearson/Spearman coefficients, p-values), or confidence intervals. This absence makes it impossible to assess whether the data support the headline results.

Authors: We agree the abstract was insufficiently detailed. The revised abstract now briefly states the scoring formulas (structural fidelity as normalized edge-overlap after alignment; factual fidelity as precision/recall over matched entities and relations), graph construction (canonical KG extraction via fixed schema of subject-predicate-object triples with entity canonicalization), and key statistics (Pearson r = 0.87, Spearman ρ = 0.85, both p < 0.001, 95% CI [0.81–0.92] on expert rankings). Full formulas and construction rules remain in §3; the abstract change is limited by length constraints. revision: yes
Referee: [§3] §3 (Method, LLM-guided matching): The alignment step is load-bearing for all downstream scores, yet the description provides no prompt templates, mismatch-handling rules, or separate validation of matching accuracy against human alignments. Because the matcher is itself an LLM, any prompt sensitivity or model-specific bias could systematically inflate correlations, especially if the evaluator LLM overlaps with the generators being scored.

Authors: We have added the exact prompt templates to new Appendix A. Mismatch-handling rules are now explicit: cosine similarity threshold of 0.8 on embeddings, with deterministic fallback to string matching for numeric cells and rejection of low-confidence LLM alignments. A new §3.4 reports human validation on 250 sampled alignments (89% exact match rate, Cohen’s κ = 0.81). The evaluator LLM (GPT-4o) was distinct from all generator models tested; we added a prompt-variation sensitivity study showing score variance < 4% across five prompt phrasings. revision: yes
Referee: [§4] §4 (Experiments, TabReX-Bench results): The reported superiority and stability claims rest on the untested assumption that the KG conversion and matching produce faithful structural/factual scores. No ablation on matcher LLM choice, no human alignment accuracy metrics, and no error analysis of hallucinated edges or alignment failures are provided, leaving the central empirical result vulnerable to unmeasured bias.

Authors: The revised §4.3 now contains: (i) an ablation across three matcher LLMs (GPT-4o, Claude-3.5, Llama-3.1-70B) confirming TabReX retains the highest correlation in all cases; (ii) the human alignment accuracy metrics noted above; and (iii) quantitative error analysis (hallucinated edges 4.1%, alignment failures 5.8%) plus qualitative examples of failure modes. These additions directly support the stability claims under the three difficulty tiers of TabReX-Bench. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents TabReX as a reference-less evaluation framework that converts inputs to knowledge graphs, applies LLM-guided matching, and derives rubric-aware scores. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the matching process is described as an explicit methodological component rather than a derived prediction. Empirical correlations with expert rankings on TabReX-Bench are external benchmarks, not internal fits. The derivation remains self-contained against the stated assumptions without load-bearing reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the framework relies on standard graph representations and LLM capabilities assumed to be available.

pith-pipeline@v0.9.0 · 5479 in / 988 out tokens · 25858 ms · 2026-05-16T21:22:40.892536+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TABREX converts both source text and generated tables into canonical knowledge graphs, aligns them through an LLM-guided matching process, and computes interpretable, rubric-aware scores that quantify structural and factual fidelity.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TablePenalty=β MI (αr MIr/Nr + αc MIc/Nc) + β EI (αr EIr/Nr + αc EIc/Nc); CellPenalty=β MI αcell MIcell/Ncell + ... + β partial αcell Γ/Ncell

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

InProceedings of the 18th An- nual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany

The E2E Dataset: New Challenges For End- to-End Generation. InProceedings of the 18th An- nual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany. Association for Computational Linguistics. Vihang Pancholi, Jainit Sushil Bafna, Tejas Anvekar, Manish Shrivastava, and Vivek Gupta. 2025. TabX- Eval: Why this is a bad table? an eXh...

work page 2025
[2]

Bleurt: Learning robust metrics for text generation,

Bleurt: Learning robust metrics for text gener- ation.Preprint, arXiv:2004.04696. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.Preprint, arXiv:2402.03300. J.W. Smith, J.E. Ever...

work page arXiv 2004
[3]

Markdown Table: {markdown_table}

Challenges in Data-to-Document Generation. InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InI...

work page 2017
[4]

0": ["header_shuffle

+ 0.9(1.01 4) = 0.18 + 0.225 = 0.405. Step 2: Cell-level penalty.Partial-match devia- tions: γ1 =ω p ·0.2 = 0.18, γ 2 =ω p ·0.5 = 0.45, X i γi = 0.63. CellPenalty=β MIαcell MIcell Ncell +β EIαcell EIcell Ncell +β partialαcell 1 Ncell X i γi = 1.0×0.8× 2 20 + 0.9×0.8× 1 20 + 0.8×0.8× 0.63 20 = 0.08 + 0.036 + 0.0202 = 0.1362. 11 Step 3: Final score. STABREX...

work page
[5]

Python block defining`apply_perturbations()`(closed by```)

work page
[6]

Separator line:`---JSON---`

work page
[7]

### Python Section Define: ```python def apply_perturbations():

Raw JSON array (no markdown fences). ### Python Section Define: ```python def apply_perturbations(): ... ``` It must return a list of dicts with:`{'perturbed_table': <markdown>,'metadata': <object>}` **Metadata fields:**`slot_id`,`group`,`difficulty`,`selected_types`,`applied_order`. Use helpers only: `create_markdown_table`,`safe_float`,`add_noise`,`safe...

work page
[8]

perturbed_table

Structural --> 2. Layout --> 3. Naming --> 4. Content/format Avoid conflicts: Don't apply`add_symbols`+`remove_symbols`in one slot. Don't rename then delete the same column. Perform ,→merges before renames. **Result Example** ```python results.append({ "perturbed_table": create_markdown_table(headers, data_rows), "metadata": {"slot_id": slot_id, "group": ...

work page
[9]

aboriginal population

The subject is the PRIMARY ENTITY: - Choose the main entity the fact is about (e.g., "aboriginal population", "non-aboriginal population"). - ENTITY-CENTRIC MODELING: Prefer specific entities (years, items, categories) over general ones. - GOOD: "2020", "product_a", "category_x" - AVOID: "company_data", "financial_info" - For time-based data: use the time...

work page 2020
[10]

revenue_2020

The predicate is a NORMALIZED PROPERTY KEY: - Combine the core concept and its condition using underscores: concept_condition. - Use lowercase throughout. - Maintain consistent patterns for similar concepts (e.g., "revenue_2020", "revenue_2021"). - Do not add prefixes like "total_", "combined_", "gross_". - Examples: - Core: "participation rate", Conditio...

work page
[11]

81.9%",

The object is the CLEAN VALUE: - Use the most atomic data point (e.g., "81.9%", "7.0 percentage points"). - Preserve units and formatting. - Use "-" if missing. --- EXAMPLE OF APPLYING THE GRAMMAR --- Source: "For the non-aboriginal population, the unemployment rate for those who are single or previously married was 8.2%." Step 1: Subject --> "non-aborigi...

work page 2020
[12]

Use ENTITY-CENTRIC modeling make specific years, categories, or items the subjects

work page
[13]

2020", "q1_2021

For time-based data: use years or periods (e.g., "2020", "q1_2021")

work page 2020
[14]

electronics

For categorical data: use categories (e.g., "electronics", "clothing")

work page
[15]

Avoid using one central subject for all facts

work page
[16]

amount",

Use consistent, minimal predicates (e.g., "amount", "value", "count")

work page
[17]

<-->``Product Alpha

Use "-" for missing data. Task Input: Input Summary: {summary} 14 Prompt D: Graph Alignment # System Prompt You are a structured reasoning engine comparing two knowledge graphs: **T1 (summary_graph)** and **T2 (table_graph)**. Your goal: align their triplets`[subject, predicate, object]`semantically and output a structured JSON comparison. ### ALIGNMENT P...

work page 2014

[1] [1]

InProceedings of the 18th An- nual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany

The E2E Dataset: New Challenges For End- to-End Generation. InProceedings of the 18th An- nual SIGdial Meeting on Discourse and Dialogue, pages 201–206, Saarbrücken, Germany. Association for Computational Linguistics. Vihang Pancholi, Jainit Sushil Bafna, Tejas Anvekar, Manish Shrivastava, and Vivek Gupta. 2025. TabX- Eval: Why this is a bad table? an eXh...

work page 2025

[2] [2]

Bleurt: Learning robust metrics for text generation,

Bleurt: Learning robust metrics for text gener- ation.Preprint, arXiv:2004.04696. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.Preprint, arXiv:2402.03300. J.W. Smith, J.E. Ever...

work page arXiv 2004

[3] [3]

Markdown Table: {markdown_table}

Challenges in Data-to-Document Generation. InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Eval- uating text generation with bert. InI...

work page 2017

[4] [4]

0": ["header_shuffle

+ 0.9(1.01 4) = 0.18 + 0.225 = 0.405. Step 2: Cell-level penalty.Partial-match devia- tions: γ1 =ω p ·0.2 = 0.18, γ 2 =ω p ·0.5 = 0.45, X i γi = 0.63. CellPenalty=β MIαcell MIcell Ncell +β EIαcell EIcell Ncell +β partialαcell 1 Ncell X i γi = 1.0×0.8× 2 20 + 0.9×0.8× 1 20 + 0.8×0.8× 0.63 20 = 0.08 + 0.036 + 0.0202 = 0.1362. 11 Step 3: Final score. STABREX...

work page

[5] [5]

Python block defining`apply_perturbations()`(closed by```)

work page

[6] [6]

Separator line:`---JSON---`

work page

[7] [7]

### Python Section Define: ```python def apply_perturbations():

Raw JSON array (no markdown fences). ### Python Section Define: ```python def apply_perturbations(): ... ``` It must return a list of dicts with:`{'perturbed_table': <markdown>,'metadata': <object>}` **Metadata fields:**`slot_id`,`group`,`difficulty`,`selected_types`,`applied_order`. Use helpers only: `create_markdown_table`,`safe_float`,`add_noise`,`safe...

work page

[8] [8]

perturbed_table

Structural --> 2. Layout --> 3. Naming --> 4. Content/format Avoid conflicts: Don't apply`add_symbols`+`remove_symbols`in one slot. Don't rename then delete the same column. Perform ,→merges before renames. **Result Example** ```python results.append({ "perturbed_table": create_markdown_table(headers, data_rows), "metadata": {"slot_id": slot_id, "group": ...

work page

[9] [9]

aboriginal population

The subject is the PRIMARY ENTITY: - Choose the main entity the fact is about (e.g., "aboriginal population", "non-aboriginal population"). - ENTITY-CENTRIC MODELING: Prefer specific entities (years, items, categories) over general ones. - GOOD: "2020", "product_a", "category_x" - AVOID: "company_data", "financial_info" - For time-based data: use the time...

work page 2020

[10] [10]

revenue_2020

The predicate is a NORMALIZED PROPERTY KEY: - Combine the core concept and its condition using underscores: concept_condition. - Use lowercase throughout. - Maintain consistent patterns for similar concepts (e.g., "revenue_2020", "revenue_2021"). - Do not add prefixes like "total_", "combined_", "gross_". - Examples: - Core: "participation rate", Conditio...

work page

[11] [11]

81.9%",

The object is the CLEAN VALUE: - Use the most atomic data point (e.g., "81.9%", "7.0 percentage points"). - Preserve units and formatting. - Use "-" if missing. --- EXAMPLE OF APPLYING THE GRAMMAR --- Source: "For the non-aboriginal population, the unemployment rate for those who are single or previously married was 8.2%." Step 1: Subject --> "non-aborigi...

work page 2020

[12] [12]

Use ENTITY-CENTRIC modeling make specific years, categories, or items the subjects

work page

[13] [13]

2020", "q1_2021

For time-based data: use years or periods (e.g., "2020", "q1_2021")

work page 2020

[14] [14]

electronics

For categorical data: use categories (e.g., "electronics", "clothing")

work page

[15] [15]

Avoid using one central subject for all facts

work page

[16] [16]

amount",

Use consistent, minimal predicates (e.g., "amount", "value", "count")

work page

[17] [17]

<-->``Product Alpha

Use "-" for missing data. Task Input: Input Summary: {summary} 14 Prompt D: Graph Alignment # System Prompt You are a structured reasoning engine comparing two knowledge graphs: **T1 (summary_graph)** and **T2 (table_graph)**. Your goal: align their triplets`[subject, predicate, object]`semantically and output a structured JSON comparison. ### ALIGNMENT P...

work page 2014