Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain

An-Zi Yen; Shang-Hsuan Chiang; Tsan-Tsung Yang; Wen-Chih Peng

arxiv: 2604.26501 · v1 · submitted 2026-04-29 · 💻 cs.CL · cs.AI· cs.HC

Tree-of-Text: A Tree-based Prompting Framework for Table-to-Text Generation in the Sports Domain

Shang-Hsuan Chiang , Tsan-Tsung Yang , An-Zi Yen , Wen-Chih Peng This is my paper

Pith reviewed 2026-05-07 11:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords table-to-text generationprompting frameworksports domainlarge language modelscontent planningoperation executionhallucination reduction

0 comments

The pith

Tree-of-Text uses a three-stage tree-structured prompt to guide LLMs in turning sports data tables into accurate narrative reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that structuring LLM prompts as a tree with explicit content planning, operation execution on sub-tables, and final merging produces more faithful sports game summaries than flat or chain-based prompting. It targets the core difficulty that LLMs often hallucinate or miss key facts when interpreting large structured tables directly. A sympathetic reader would care because reliable automatic report generation from box scores or play logs could scale sports coverage without large annotated training sets. If the approach holds, it shows that decomposing table handling into selectable operations and incremental text pieces can measurably raise relevance, content selection, and coherence scores. The experiments further indicate that these gains arrive at roughly 40 percent of the runtime and token cost of the nearest competing method.

Core claim

Tree-of-Text decomposes table-to-text generation into a Content Planning stage that selects operations and arguments from the input tables, an Operation Execution stage that splits large tables into smaller sub-tables and runs the chosen operations, and a Content Generation stage that merges and rewrites the resulting short texts into a single coherent report. On ShuttleSet+ the method outperforms prior prompting baselines; on RotoWire-FG it leads in RG and CO metrics; on MLB it leads in CS and CO metrics, all while consuming approximately 40 percent of the time and cost of Chain-of-Table.

What carries the argument

The tree-structured prompting framework that routes the LLM through content planning to choose operations, sub-table execution of those operations, and final merging of short textual outputs.

If this is right

The framework yields higher relevance and content-selection scores than existing prompt-based baselines on ShuttleSet+.
It records the best RG and CO results among compared methods on RotoWire-FG.
It records the best CS and CO results among compared methods on MLB.
It achieves the reported metric gains at roughly 40 percent of the runtime and token usage of Chain-of-Table.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same planning-plus-sub-table pattern could be adapted to other table-to-text domains such as financial summaries or medical records by changing only the allowed operation vocabulary.
Because the method produces intermediate short texts, it creates natural checkpoints where human editors could intervene without restarting the entire generation.
The cost reduction suggests the approach may become practical for real-time or high-volume deployment where full chain-of-thought prompting would be too expensive.

Load-bearing premise

The LLM will reliably choose the correct operations and arguments during content planning and will execute them on the sub-tables without introducing factual errors or hallucinations when the input tables are large or noisy.

What would settle it

Apply Tree-of-Text to a fresh collection of large, noisy sports tables drawn from a different league or season and measure the factual error rate of the generated reports against the original table values; a sharp rise in unsupported facts would falsify the claim of reliable execution.

Figures

Figures reproduced from arXiv: 2604.26501 by An-Zi Yen, Shang-Hsuan Chiang, Tsan-Tsung Yang, Wen-Chih Peng.

**Figure 1.** Figure 1: An example from ShuttleSet+, which includes multiple structured tables containing match data along with view at source ↗

**Figure 2.** Figure 2: The overall framework of Tree-of-Text, which constructs a tree structure and divides the task into three view at source ↗

**Figure 3.** Figure 3: The detailed workflow of Content Planning, Operation Execution, and Content Generating. view at source ↗

**Figure 4.** Figure 4: Prompt for Content Planning 15 view at source ↗

**Figure 5.** Figure 5: Prompt for write() operation System : You are a content generator for the badminton game report . Please merge and rewrite a New Report based on the input Reports . # Requirements 1. Strictly adhere to the requirements . 2. The output must be in English . 3. The output must be based on the input data ; do not hallucinate . 4. The New Report must include all the content from the input Reports ; do not omit … view at source ↗

**Figure 6.** Figure 6: Prompt for Content Generating 16 view at source ↗

**Figure 7.** Figure 7: Prompt for LLM-based IE model Usage rate Depth 1 2 3 4 5 0% 25% 50% 75% 100% root select_table select_row select_col count sort filter write Usage rate of operations at each depth view at source ↗

**Figure 8.** Figure 8: Usage rate of operations at each level, where the horizontal axis indicates the usage rate, the vertical axis view at source ↗

**Figure 9.** Figure 9: The qualitative results of Human, Chain-of-Table, and Tree-of-Text outputs. view at source ↗

read the original abstract

Generating sports game reports from structured tables is a complex table-to-text task that demands both precise data interpretation and fluent narrative generation. Traditional model-based approaches require large, annotated datasets, while prompt-based methods using large language models (LLMs) often struggle with hallucination due to weak table comprehension. To overcome these challenges, we propose Tree-of-Text, a tree-structured prompting framework that guides LLMs through a three-stage generation process: (1) Content Planning, where relevant operations and arguments are selected from the input tables; (2) Operation Execution, which breaks down large tables into manageable sub-tables; and (3) Content Generation, where short textual outputs are merged and rewritten into a cohesive report. Experiments show that our method outperforms existing methods on ShuttleSet+, leads in RG and CO metrics on RotoWire-FG, and excels in CS and CO on MLB with roughly 40% of the time and cost of Chain-of-Table. These results demonstrate the effectiveness and efficiency of Tree-of-Text and suggest a promising direction for prompt-based table-to-text generation in the sports domain.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tree-of-Text introduces a three-stage tree prompting structure for sports table-to-text that claims metric leads and 40% lower cost than Chain-of-Table, but provides no direct checks on planning accuracy.

read the letter

The paper puts forward Tree-of-Text as a way to generate text from sports tables using a tree of prompts. It has the model plan by selecting operations and arguments, execute on split sub-tables, and then merge the results into a report. This structure is presented as an improvement over standard chain prompting and the Chain-of-Table method. The new element is the explicit decomposition into those three stages with a tree organization. It aims to keep the model focused on smaller pieces of the data at each step, which could help with the wide tables common in sports stats. The paper shows results on ShuttleSet+, RotoWire-FG, and MLB. It says the method outperforms existing ones on the first, leads in RG and CO on the second, and in CS and CO on the third. It also uses roughly 40% of the time and cost of Chain-of-Table. If the numbers are solid, this efficiency is a real plus for repeated use. The concern is that the planning stage has no reported accuracy or error analysis. The model has to correctly identify what operations to run and on which parts of the table. Sports data can be noisy, and any mistake there carries through to the final text. The results do not include ablations that isolate the tree from other prompt changes or measure how often the sub-tables match human expectations. This work is useful for people who build LLM systems for domain-specific data-to-text, particularly in sports or other table-heavy areas. It provides a practical recipe rather than a new model or theory. The approach shows clear engagement with the problem of table comprehension in prompting. It deserves peer review so the experimental setup can be examined in full and the authors can add the needed checks on each stage. I recommend sending it to referees.

Referee Report

2 major / 2 minor

Summary. The paper proposes Tree-of-Text, a tree-structured prompting framework for table-to-text generation in the sports domain. The approach decomposes the task into three stages: (1) Content Planning, in which the LLM selects relevant operations and arguments from the input tables; (2) Operation Execution, which decomposes large tables into manageable sub-tables; and (3) Content Generation, in which short textual outputs are merged and rewritten into a cohesive report. Experiments are reported to show outperformance over existing methods on ShuttleSet+, leadership in RG and CO metrics on RotoWire-FG, excellence in CS and CO on MLB, and achievement of these results at roughly 40% of the time and cost of Chain-of-Table.

Significance. If the reported gains prove robust, the work would be significant for prompt-based table-to-text generation by providing a structured decomposition that targets hallucination while delivering measurable efficiency gains. The tree-based extension of chain-of-thought ideas is a natural fit for wide, noisy sports tables and could generalize to other structured-data domains. No machine-checked proofs or parameter-free derivations are present, but the efficiency comparison supplies a concrete, falsifiable practical claim.

major comments (2)

[§4 (Experimental Results)] §4 (Experimental Results): The central outperformance and efficiency claims rest on the Content Planning stage reliably selecting correct operations and arguments from large or noisy sports tables. No accuracy metrics for operation/argument selection, no human evaluation of the generated plans, and no error analysis of planning failures are provided. Because any planning error produces incorrect sub-tables whose outputs are merged in stage 3, this omission directly undermines the reported gains on ShuttleSet+, RotoWire-FG, and MLB.
[§4.2 (Ablations)] §4.2 (Ablations): No ablation isolates the contribution of the tree structure itself from generic multi-stage prompting or from the particular choice of operations. Without such controls it is impossible to attribute the metric improvements (RG, CO, CS) to the proposed framework rather than prompt engineering or dataset-specific factors.

minor comments (2)

[Abstract] The abstract states 'roughly 40% of the time and cost'; the main text or appendix should supply exact wall-clock times, token counts, API costs, and any variance across runs to support reproducibility.
[§3 (Method)] A concrete worked example showing one input table, the selected operations/arguments, the resulting sub-tables, and the final merged text would substantially improve clarity of the three-stage pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight valuable opportunities to strengthen the experimental analysis, and we address each point below with proposed revisions.

read point-by-point responses

Referee: §4 (Experimental Results): The central outperformance and efficiency claims rest on the Content Planning stage reliably selecting correct operations and arguments from large or noisy sports tables. No accuracy metrics for operation/argument selection, no human evaluation of the generated plans, and no error analysis of planning failures are provided. Because any planning error produces incorrect sub-tables whose outputs are merged in stage 3, this omission directly undermines the reported gains on ShuttleSet+, RotoWire-FG, and MLB.

Authors: We agree that direct assessment of the Content Planning stage is necessary to substantiate the end-to-end results and rule out error propagation. In the revised manuscript we will add accuracy metrics for operation and argument selection on sampled instances from each dataset, human evaluation of plan correctness, and a categorized error analysis of planning failures together with their downstream effects on the final reports. These additions will directly address the concern. revision: yes
Referee: §4.2 (Ablations): No ablation isolates the contribution of the tree structure itself from generic multi-stage prompting or from the particular choice of operations. Without such controls it is impossible to attribute the metric improvements (RG, CO, CS) to the proposed framework rather than prompt engineering or dataset-specific factors.

Authors: The current ablations compare against Chain-of-Table (a linear multi-stage baseline) and vary the set of operations. To more explicitly isolate the tree structure, we will add a new ablation in the revision that contrasts Tree-of-Text against a flat (non-hierarchical) multi-stage prompting variant using identical stages and operations. This control will help attribute gains specifically to the tree-based decomposition. revision: yes

Circularity Check

0 steps flagged

No circularity: prompting recipe with external evaluation

full rationale

The paper describes a three-stage prompting framework (Content Planning, Operation Execution, Content Generation) for table-to-text generation. No equations, fitted parameters, derivations, or mathematical claims are present that could reduce to inputs by construction. Evaluation relies on external benchmarks (ShuttleSet+, RotoWire-FG, MLB) and comparison to Chain-of-Table, with no self-citation load-bearing the central method. The approach is a self-contained prompting recipe rather than a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can be guided by explicit stage prompts to avoid hallucination on tabular input; no free parameters, invented entities, or non-standard axioms are stated.

axioms (1)

domain assumption Large language models can follow multi-stage instructions to select and execute table operations without introducing unsupported facts.
Invoked implicitly in the description of Content Planning and Operation Execution stages.

pith-pipeline@v0.9.0 · 5513 in / 1128 out tokens · 40572 ms · 2026-05-07T11:15:29.438588+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

IEEE Transactions on Knowledge and Data Engi- neering, 36(4):1431–1449

A survey on neural data-to-text generation. IEEE Transactions on Knowledge and Data Engi- neering, 36(4):1431–1449. Jordan J. Louviere, Terry N. Flynn, and A. A. J. Marley. 2015.Best-Worst Scaling: Theory, Methods and Applications. Cambridge University Press. OpenAI. 2024. Gpt-4o mini: Advancing cost-efficient intelligence. OpenAI. 2025. API Pricing. Acce...

work page 2015
[2]

Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister

Shuttleset22: Benchmarking stroke forecast- ing with stroke-level badminton dataset.CoRR, abs/2306.15664. Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024. Chain-of-table: Evolving tables in the reasoning chain for table unde...

work page arXiv 2024
[3]

InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark

Challenges in data-to-document generation. InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan

work page 2017
[4]

InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822

Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. A Example Prompt A.1 Example Prompt for Content Planning Figure 4 shows an example prompt for Con- tent Planning on the ShuttleSet+ dataset. In this prompt, {TABLE_DESCRIPTION} ...

work page arXiv 1995
[8]

The table format is { TABLE_FORMAT }

work page
[9]

The length of Operation History must be less than or equal to { MAX_DEPTH }

work page
[10]

The number of Operations must be less than or equal to { MAX_DEGREE }

work page
[11]

Only select Operations from the Operation Pool

work page
[12]

Arguments must match the format required by the corresponding Operations

work page
[13]

Operations & Arguments must follow this format : [ operation_1 ( argument_1 , ...) , operation_2 ( argument_2 , ...) , operation_3 ( argument_3 , ...) , ...]

work page
[14]

Only output Operations & Arguments !

work page
[15]

The number of tokens in the Operations & Arguments must be within { PLANNING_TOKENS }. # Table Description { TAB LE _D ES CRI PT IO N } # Operation Description { O P E R A T I O N _ D E S C R I P T I O N } User : # Test ## Tables { TABLES } ## Operation History { OPE RA TI ON _HI ST OR Y } ## Operation Pool { OPERATION_POOL } ## Operations & Arguments Fig...

work page
[19]

The Table format is { TABLE_FORMAT }

work page
[20]

The Report can only describe the content included in the Tables and cannot describe anything not included in the Tables

work page
[21]

The Report must consist of only one paragraph

work page
[22]

The number of tokens in the Report must be within { WRITE_TOKENS }. # Table Description { TAB LE _D ES CRI PT IO N } User : # Test ## Tables { TABLES } ## Report Figure 5: Prompt forwrite()operation System : You are a content generator for the badminton game report . Please merge and rewrite a New Report based on the input Reports . # Requirements

work page
[26]

The New Report must include all the content from the input Reports ; do not omit any information

work page
[27]

The New Report must follow the order of the input Reports

work page
[28]

User : # Test ## Reports { REPORTS } ## New Report Figure 6: Prompt for Content Generating 16 System : You are a relation extractor for the badminton game report

The number of tokens in the New Report must be within { G ENE RA TI NG _T OKE NS }. User : # Test ## Reports { REPORTS } ## New Report Figure 6: Prompt for Content Generating 16 System : You are a relation extractor for the badminton game report . Please extract the Report Relation contained in the Report from the Table Relation . There is an Example that...

work page
[29]

Strictly adhere to the requirements

work page
[30]

The output must be in English

work page
[31]

The output must be based on the input data ; do not hallucinate

work page
[32]

Please do not output any Report Relation that is not included in the Report

work page
[33]

Please do not output any Report Relation that is not included in the Table Relation

work page
[34]

The Report Relation must contain all the relations from the input Report ; do not omit any relation

work page
[35]

The Report Relation must follow the order in the input Report

work page
[36]

return net

The Report Relation must follow the format : [( table | column | value ) , ( table | column | value ) , ...] # Table Description { TAB LE _D ES CRI PT IO N } User : # Test ## Report { REPORT } ## Table Relation { TABLE_RELATION } ## Report Relation Figure 7: Prompt for LLM-based IE model Usage rate Depth 1 2 3 4 5 0% 25% 50% 75% 100% root select_table sel...

work page 2022

[1] [1]

IEEE Transactions on Knowledge and Data Engi- neering, 36(4):1431–1449

A survey on neural data-to-text generation. IEEE Transactions on Knowledge and Data Engi- neering, 36(4):1431–1449. Jordan J. Louviere, Terry N. Flynn, and A. A. J. Marley. 2015.Best-Worst Scaling: Theory, Methods and Applications. Cambridge University Press. OpenAI. 2024. Gpt-4o mini: Advancing cost-efficient intelligence. OpenAI. 2025. API Pricing. Acce...

work page 2015

[2] [2]

Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister

Shuttleset22: Benchmarking stroke forecast- ing with stroke-level badminton dataset.CoRR, abs/2306.15664. Zilong Wang, Hao Zhang, Chun-Liang Li, Julian Martin Eisenschlos, Vincent Perot, Zifeng Wang, Lesly Mi- culicich, Yasuhisa Fujii, Jingbo Shang, Chen-Yu Lee, and Tomas Pfister. 2024. Chain-of-table: Evolving tables in the reasoning chain for table unde...

work page arXiv 2024

[3] [3]

InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark

Challenges in data-to-document generation. InProceedings of the 2017 Conference on Empiri- cal Methods in Natural Language Processing, pages 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan

work page 2017

[4] [4]

InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822

Tree of thoughts: Deliberate problem solving with large language models. InAdvances in Neural Information Processing Systems, volume 36, pages 11809–11822. Curran Associates, Inc. A Example Prompt A.1 Example Prompt for Content Planning Figure 4 shows an example prompt for Con- tent Planning on the ShuttleSet+ dataset. In this prompt, {TABLE_DESCRIPTION} ...

work page arXiv 1995

[5] [8]

The table format is { TABLE_FORMAT }

work page

[6] [9]

The length of Operation History must be less than or equal to { MAX_DEPTH }

work page

[7] [10]

The number of Operations must be less than or equal to { MAX_DEGREE }

work page

[8] [11]

Only select Operations from the Operation Pool

work page

[9] [12]

Arguments must match the format required by the corresponding Operations

work page

[10] [13]

Operations & Arguments must follow this format : [ operation_1 ( argument_1 , ...) , operation_2 ( argument_2 , ...) , operation_3 ( argument_3 , ...) , ...]

work page

[11] [14]

Only output Operations & Arguments !

work page

[12] [15]

The number of tokens in the Operations & Arguments must be within { PLANNING_TOKENS }. # Table Description { TAB LE _D ES CRI PT IO N } # Operation Description { O P E R A T I O N _ D E S C R I P T I O N } User : # Test ## Tables { TABLES } ## Operation History { OPE RA TI ON _HI ST OR Y } ## Operation Pool { OPERATION_POOL } ## Operations & Arguments Fig...

work page

[13] [19]

The Table format is { TABLE_FORMAT }

work page

[14] [20]

The Report can only describe the content included in the Tables and cannot describe anything not included in the Tables

work page

[15] [21]

The Report must consist of only one paragraph

work page

[16] [22]

The number of tokens in the Report must be within { WRITE_TOKENS }. # Table Description { TAB LE _D ES CRI PT IO N } User : # Test ## Tables { TABLES } ## Report Figure 5: Prompt forwrite()operation System : You are a content generator for the badminton game report . Please merge and rewrite a New Report based on the input Reports . # Requirements

work page

[17] [26]

The New Report must include all the content from the input Reports ; do not omit any information

work page

[18] [27]

The New Report must follow the order of the input Reports

work page

[19] [28]

User : # Test ## Reports { REPORTS } ## New Report Figure 6: Prompt for Content Generating 16 System : You are a relation extractor for the badminton game report

The number of tokens in the New Report must be within { G ENE RA TI NG _T OKE NS }. User : # Test ## Reports { REPORTS } ## New Report Figure 6: Prompt for Content Generating 16 System : You are a relation extractor for the badminton game report . Please extract the Report Relation contained in the Report from the Table Relation . There is an Example that...

work page

[20] [29]

Strictly adhere to the requirements

work page

[21] [30]

The output must be in English

work page

[22] [31]

The output must be based on the input data ; do not hallucinate

work page

[23] [32]

Please do not output any Report Relation that is not included in the Report

work page

[24] [33]

Please do not output any Report Relation that is not included in the Table Relation

work page

[25] [34]

The Report Relation must contain all the relations from the input Report ; do not omit any relation

work page

[26] [35]

The Report Relation must follow the order in the input Report

work page

[27] [36]

return net

The Report Relation must follow the format : [( table | column | value ) , ( table | column | value ) , ...] # Table Description { TAB LE _D ES CRI PT IO N } User : # Test ## Report { REPORT } ## Table Relation { TABLE_RELATION } ## Report Relation Figure 7: Prompt for LLM-based IE model Usage rate Depth 1 2 3 4 5 0% 25% 50% 75% 100% root select_table sel...

work page 2022