Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Beiwen Zhang; Guowei Zou; Haitao Wang; Hejun Wu; Yongheng Liang

arxiv: 2606.08596 · v1 · pith:SJUCWKO5new · submitted 2026-06-07 · 💻 cs.AI · cs.HC

Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Beiwen Zhang , Yongheng Liang , Guowei Zou , Haitao Wang , Hejun Wu This is my paper

Pith reviewed 2026-06-27 18:49 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords policy treeLLM distillationhuman-AI collaborationinterpretable policiesOvercooked-AImulti-agent systemsnatural language feedbackclosed-loop refinement

0 comments

The pith

Co-pi-tree distills LLM reasoning into an executable policy tree refined by natural-language interaction feedback for more efficient human-AI collaboration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that distilling large-language-model reasoning into a fixed policy tree, then closing the loop with natural-language summaries of interaction outcomes, produces policies that are both interpretable and far cheaper to run than querying an LLM at every step. The tree splits into two parts: one that predicts partner behavior and one that selects the agent's own actions. Experiments in the Overcooked-AI environment report a 35.4 percent gain in average reward over baselines together with 77.7 percent fewer LLM calls and 97.1 percent lower test-time latency. The method therefore aims to combine the reasoning power of LLMs with the speed and transparency of an explicit, editable decision structure.

Core claim

Co-pi-tree constructs an executable policy by first distilling LLM reasoning into code that defines a partner-behavior prediction tree and an agent-action selection tree; the policy is then run in interaction with a human partner, after which natural-language summaries of the observed feedback are used to locate and rewrite problematic branches, yielding a closed-loop improvement process that requires no further LLM queries at execution time.

What carries the argument

Collaboration Policy Tree (Co-pi-tree), an executable structure containing a partner-behavior prediction tree and an agent-action selection tree that is initially distilled from LLM reasoning and subsequently edited via natural-language summaries of interaction feedback.

If this is right

Once distilled, the policy tree executes without any LLM calls, directly cutting latency and cost at deployment.
Because the tree is explicit code, a human can inspect or manually edit any branch rather than retraining an opaque model.
The same distillation-plus-refinement loop can be applied to any multi-agent task where partner behavior is partially predictable from language descriptions.
Fewer LLM queries during testing translate into lower monetary cost and lower carbon cost per collaboration episode.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may generalize to domains outside games if the partner-behavior tree can be seeded with domain-specific language descriptions of typical human actions.
If the refinement loop succeeds, it offers a practical route to keeping LLM-level reasoning inside a system that must run on low-power edge hardware.
The separation into prediction and action trees suggests a natural place to insert human overrides: a supervisor could edit only the action tree while leaving the learned partner model untouched.

Load-bearing premise

Natural language summaries of interaction feedback can reliably locate and correct the specific branches that are causing poor performance without introducing new errors.

What would settle it

A controlled trial in Overcooked-AI in which the natural-language refinement step is applied to the distilled tree yet average reward does not rise or LLM-query count does not fall relative to the unrefined tree.

Figures

Figures reproduced from arXiv: 2606.08596 by Beiwen Zhang, Guowei Zou, Haitao Wang, Hejun Wu, Yongheng Liang.

**Figure 2.** Figure 2: Overview of the Co-π-tree pipeline. 2024), HLA uses natural-language communication for command interpretation and macro-action generation (Liu et al., 2024a), and CausalPlan augments LLM action selection with causal scores over candidate actions (Nguyen et al., 2026). These methods demonstrate the potential of languagebased reasoning for coordination. However, they often rely on LLM-driven decisions dur… view at source ↗

**Figure 3.** Figure 3: Box plots of human-agent collaboration rewards by layout. Each box shows per-volunteer average rewards [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison among ProAgent, CausalPlan, and Co-π-tree in terms of NQ and test-time latency. especially clear in CT. Circuit and Asymm. Adv., where successful collaboration requires stable role specialization. The main exception is Forced Coord., where Co-π-tree is competitive as Player 0 but weaker as Player 1. We provide a brief discussion in Appendix A.6. Beyond reward, [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 5.** Figure 5: Overcooked-AI layouts used in our evaluation. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Reward growth over iterative refinement across five Overcooked-AI layouts. Each panel shows one [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-layout transfer performance for SP, ProAgent, and Co- [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

read the original abstract

Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work. Most prior work relies on multi-agent reinforcement learning (MARL) to learn black-box policies, which limits interpretability and raises safety concerns. Recent methods query large language models (LLMs) at each decision step, causing slow responses and high inference costs. We propose Collaboration Policy Tree (Co-pi-tree), a closed-loop method that learns an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree. Co-pi-tree constructs a policy by distilling LLM reasoning into policy tree code. It then evaluates the policy through partner interaction, obtains feedback, and uses natural language to summarize the interaction feedback to improve problematic branches. Experiments in Overcooked-AI show that Co-pi-tree improves average reward by 35.4% over the baseline average, while reducing the number of LLM queries by 77.7% and test-time latency by 97.1%. Project page: https://beiwenzhang.github.io/Co-pi-tree/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Co-pi-tree distills LLM reasoning into a dual policy tree with NL feedback refinement for Overcooked, but the abstract gives no mechanism or ablation details to back the 35% reward and efficiency claims.

read the letter

The paper's main contribution is a pipeline that first distills an LLM's step-by-step reasoning into executable code for two linked trees: one predicting partner behavior and one selecting the agent's action. It then runs the tree in interaction, summarizes the results in natural language, and edits specific branches that underperform. This is presented as a middle path between opaque MARL policies and expensive per-step LLM queries.

What works is the concrete framing for a cooperative game setting. The efficiency targets—fewer LLM calls and much lower latency—are the right ones for deployment, and the dual-tree structure makes the policy inspectable in a way black-box methods are not.

The soft spot is exactly the closed-loop refinement. The abstract states that natural language summaries identify and fix problematic branches, yet supplies no description of how the summaries are produced, how they map to tree edits, or any ablation that isolates this step from the initial distillation. The 35.4% reward lift, 77.7% query reduction, and 97.1% latency drop are reported without protocol, baselines, variance, or controls, so there is no way to tell whether the gains come from the tree structure itself or from the feedback loop. The stress-test concern lands: if the summarization step is noisy or targets the wrong branches, the headline numbers do not follow.

This is for people building human-AI systems in games or simple robotics who need both speed and some interpretability. A reader already working on Overcooked-style tasks could extract the high-level recipe and test it, but the current write-up does not yet support the performance claims.

I would send it for peer review only if the full manuscript contains the missing ablations and experimental details; otherwise the central engineering claim is too thinly supported to justify referee time.

Referee Report

3 major / 1 minor

Summary. The paper proposes Co-pi-tree, a closed-loop method that distills LLM reasoning into an executable policy tree consisting of a partner-behavior prediction tree and an agent-action selection tree for human-AI collaboration tasks. The approach constructs the tree via distillation, then uses interaction feedback summarized in natural language to identify and correct problematic branches. Experiments in Overcooked-AI are reported to yield a 35.4% improvement in average reward over baseline, along with 77.7% fewer LLM queries and 97.1% lower test-time latency.

Significance. If the quantitative claims hold under rigorous controls, the work would offer a meaningful contribution by producing interpretable policies that combine the reasoning capacity of LLMs with execution efficiency and feedback-driven refinement, addressing limitations of both black-box MARL and per-step LLM querying in collaborative settings.

major comments (3)

The abstract and method description provide no ablation isolating the closed-loop natural-language feedback correction step from the initial distillation; without this, the reported 35.4% reward gain, 77.7% query reduction, and 97.1% latency reduction cannot be attributed to the claimed refinement mechanism rather than the base tree alone.
No mechanism details, error analysis, or consistency checks are supplied for how natural-language summaries of interaction feedback map to specific branch edits in the policy tree, leaving open the possibility that the summarization step introduces new errors or fails to target the branches responsible for measured gains.
The experimental claims reference Overcooked-AI results but supply no protocol details, baseline definitions, statistical tests, variance measures, or controls in the provided text, rendering the quantitative improvements unverifiable as support for the central claims.

minor comments (1)

The project page link is mentioned but no supplementary material or code repository is referenced to support reproducibility of the policy tree construction and feedback loop.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without misrepresenting the current content.

read point-by-point responses

Referee: The abstract and method description provide no ablation isolating the closed-loop natural-language feedback correction step from the initial distillation; without this, the reported 35.4% reward gain, 77.7% query reduction, and 97.1% latency reduction cannot be attributed to the claimed refinement mechanism rather than the base tree alone.

Authors: We agree that the current manuscript lacks an explicit ablation isolating the closed-loop feedback correction. The reported gains are for the full Co-pi-tree system, but without this comparison the attribution remains unclear. In revision we will add an ablation study comparing the initial distilled tree to the feedback-refined version on the same Overcooked-AI tasks. revision: yes
Referee: No mechanism details, error analysis, or consistency checks are supplied for how natural-language summaries of interaction feedback map to specific branch edits in the policy tree, leaving open the possibility that the summarization step introduces new errors or fails to target the branches responsible for measured gains.

Authors: The manuscript describes the high-level closed-loop process but does not supply the requested low-level mechanism, error analysis, or consistency checks. We will expand the method section with concrete examples of how natural-language feedback summaries are parsed into branch edits, including any safeguards or post-edit validation steps used. revision: yes
Referee: The experimental claims reference Overcooked-AI results but supply no protocol details, baseline definitions, statistical tests, variance measures, or controls in the provided text, rendering the quantitative improvements unverifiable as support for the central claims.

Authors: The referee is correct that the text excerpt lacks these details. While the full manuscript contains an experimental section, we will revise to include explicit protocol descriptions, baseline definitions, statistical tests, variance reporting, and control conditions so that the quantitative results can be independently verified. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no derivations or fitted predictions

full rationale

The paper describes an engineering pipeline (distill LLM reasoning into executable policy tree code, interact to collect feedback, summarize feedback in natural language to edit branches) and reports experimental gains on Overcooked-AI. No equations, parameters fitted to subsets then re-predicted, self-definitional constructs, or load-bearing self-citations appear in the provided text. The central claims rest on measured interaction outcomes rather than any reduction of a result to its own inputs by construction. The method is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no equations, parameters, or modeling assumptions, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5732 in / 1189 out tokens · 26446 ms · 2026-06-27T18:49:35.315675+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 canonical work pages

[1]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compres- sion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand. Association for Computational Linguistics. Yang Li, Shao Zhang, Jichen Sun, Yali Du, Ying Wen, Xinbing Wa...

work page arXiv 2023
[2]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing: Industry Track, pages 472–491, Suzhou, China

SLOT: Structuring the output of large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing: Industry Track, pages 472–491, Suzhou, China. Association for Computational Linguistics. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Re- flexion: Language agents wit...

2025
[3]

InAdvances in Neural Information Processing Systems

Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems. Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, and Pengfei Liu. 2025. LIMOPro: Reasoning refinement for efficient and effective test-time scaling. InAdvances in Neural Information Processing Systems, volume...

work page arXiv 2025
[4]

InAdvances in Neural Information Processing Systems

Self-discover: Large language models self- compose reasoning structures. InAdvances in Neural Information Processing Systems. 10 A Additional Details A.1 Layout Descriptions We evaluate Co- π-tree on five standard Overcooked-AI layouts, which cover differ- ent coordination bottlenecks. Cramped Room is a compact kitchen with one pot and one serving locatio...

2026
[5]

The inferred current action of <TEAMMATE_PLAYER>, and
[6]

<one legal action>

Exactly one legal action for <SELF_PLAYER> now. Required output format: ### FunctionDescription: Name: PredictTeammateThenPlan Inputs: - current_scene: - holdings: {self: empty/onion/dish/soup, teammate: empty/onion/dish/soup} - pots: for each <Pot>: onion_count: {0,1,2,3}, state: {idle,cooking,ready}, timers - derived flags: any_soup_ready, any_pot_not_f...
[7]

Causes clear throughput loss across multiple scenes, not just one isolated mistake,
[8]

Appears as repeated wasted timesteps, redundant role overlap, premature dish handling, delayed role switching, or legal-but-low-value actions,
[9]

PRIORITIZATION RULE: - Prefer a bottleneck that repeatedly reduces soup throughput, even if all actions are technically legal

Can still be attributed to ONE specific branch, priority rule, or missing condition in the decision tree. PRIORITIZATION RULE: - Prefer a bottleneck that repeatedly reduces soup throughput, even if all actions are technically legal. - Do NOT over-prioritize teammate prediction mistakes unless they clearly create repeated downstream time loss. - A valid-bu...
[10]

Scan all episode states and identify time-wasting behaviors
[11]

Select exactly ONE primary inefficiency with the largest negative impact on efficiency or score
[12]

Identify the EXACT target branch or decision condition to modify
[13]

Generate a Tree_Reflexion that guides a planner to perform a SINGLE, LOCAL modification: - describe the CURRENT behavior of this branch (BEFORE), - describe the DESIRED behavior after modification (AFTER), using refined conditions, priority changes, or a single added guard condition, - explicitly list assumptions about what logic MUST remain unchanged
[14]

Decision_Tree_Summary

The value in Decision_Tree_Summary.Final_Score MUST be exactly <S_k>. OUTPUT REQUIREMENTS: - Output MUST be valid JSON. - Do NOT include markdown, explanations, or extra text. - Be concrete and implementable. - Output must directly contain only these top-level keys: - "Decision_Tree_Summary" - "Tree_Reflexion" Return JSON in the following format: { "Decis...

[1] [1]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compres- sion. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand. Association for Computational Linguistics. Yang Li, Shao Zhang, Jichen Sun, Yali Du, Ying Wen, Xinbing Wa...

work page arXiv 2023

[2] [2]

InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing: Industry Track, pages 472–491, Suzhou, China

SLOT: Structuring the output of large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Process- ing: Industry Track, pages 472–491, Suzhou, China. Association for Computational Linguistics. Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Re- flexion: Language agents wit...

2025

[3] [3]

InAdvances in Neural Information Processing Systems

Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems. Yang Xiao, Jiashuo Wang, Ruifeng Yuan, Chunpu Xu, Kaishuai Xu, Wenjie Li, and Pengfei Liu. 2025. LIMOPro: Reasoning refinement for efficient and effective test-time scaling. InAdvances in Neural Information Processing Systems, volume...

work page arXiv 2025

[4] [4]

InAdvances in Neural Information Processing Systems

Self-discover: Large language models self- compose reasoning structures. InAdvances in Neural Information Processing Systems. 10 A Additional Details A.1 Layout Descriptions We evaluate Co- π-tree on five standard Overcooked-AI layouts, which cover differ- ent coordination bottlenecks. Cramped Room is a compact kitchen with one pot and one serving locatio...

2026

[5] [5]

The inferred current action of <TEAMMATE_PLAYER>, and

[6] [6]

<one legal action>

Exactly one legal action for <SELF_PLAYER> now. Required output format: ### FunctionDescription: Name: PredictTeammateThenPlan Inputs: - current_scene: - holdings: {self: empty/onion/dish/soup, teammate: empty/onion/dish/soup} - pots: for each <Pot>: onion_count: {0,1,2,3}, state: {idle,cooking,ready}, timers - derived flags: any_soup_ready, any_pot_not_f...

[7] [7]

Causes clear throughput loss across multiple scenes, not just one isolated mistake,

[8] [8]

Appears as repeated wasted timesteps, redundant role overlap, premature dish handling, delayed role switching, or legal-but-low-value actions,

[9] [9]

PRIORITIZATION RULE: - Prefer a bottleneck that repeatedly reduces soup throughput, even if all actions are technically legal

Can still be attributed to ONE specific branch, priority rule, or missing condition in the decision tree. PRIORITIZATION RULE: - Prefer a bottleneck that repeatedly reduces soup throughput, even if all actions are technically legal. - Do NOT over-prioritize teammate prediction mistakes unless they clearly create repeated downstream time loss. - A valid-bu...

[10] [10]

Scan all episode states and identify time-wasting behaviors

[11] [11]

Select exactly ONE primary inefficiency with the largest negative impact on efficiency or score

[12] [12]

Identify the EXACT target branch or decision condition to modify

[13] [13]

Generate a Tree_Reflexion that guides a planner to perform a SINGLE, LOCAL modification: - describe the CURRENT behavior of this branch (BEFORE), - describe the DESIRED behavior after modification (AFTER), using refined conditions, priority changes, or a single added guard condition, - explicitly list assumptions about what logic MUST remain unchanged

[14] [14]

Decision_Tree_Summary

The value in Decision_Tree_Summary.Final_Score MUST be exactly <S_k>. OUTPUT REQUIREMENTS: - Output MUST be valid JSON. - Do NOT include markdown, explanations, or extra text. - Be concrete and implementable. - Output must directly contain only these top-level keys: - "Decision_Tree_Summary" - "Tree_Reflexion" Return JSON in the following format: { "Decis...