How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

Henry Lieberman; Honjar Xing; Jefferson Lin

arxiv: 2606.20978 · v1 · pith:B3LUXQXWnew · submitted 2026-06-18 · 💻 cs.AI · cs.HC

How Should Agents Read Demonstrations? Hierarchical Structure Beats Flat Action Logs

Honjar Xing , Jefferson Lin , Henry Lieberman This is my paper

Pith reviewed 2026-06-26 16:48 UTC · model grok-4.3

classification 💻 cs.AI cs.HC

keywords programming by demonstrationLLM agentshierarchical demonstrationsdemonstration structureweb automationtask ambiguityprocedural knowledge

0 comments

The pith

Hierarchically grouped demonstrations improve LLM agent success rates from 76.7% to 90.7% on tasks with vague descriptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how to organize recorded action sequences before feeding them to LLM agents in programming-by-demonstration settings. It compares four formats that share the same underlying actions but differ in grouping and labeling. On 43 web tasks whose natural-language instructions leave procedural details open, presenting actions as labeled hierarchical subgoals raises pass rates to 90.7 percent while flat logs produce only a small, non-significant gain. The benefit vanishes on 42 tasks whose instructions already specify the steps precisely, indicating that the structure helps mainly when ambiguity must be resolved from the demonstration itself. Ablations confirm that the grouping into subgoals, rather than added annotations, accounts for the measured improvement.

Core claim

Across 85 web automation tasks the authors compare a zero-shot baseline to four demonstration formats that contain identical action sequences. On the 43 tasks whose descriptions are vague, hierarchically grouped demonstrations raise pass rates from 76.7 percent to 90.7 percent (paired permutation test p=0.034). Flat demonstrations produce a smaller, non-significant lift. On the 42 tasks with precise descriptions none of the formats improves performance. An ablation isolating subgoal grouping shows that preconditions, postconditions, and parameter annotations add no further measurable benefit.

What carries the argument

Hierarchical grouping of recorded actions into labeled subgoals, which segments the flat action log to supply explicit procedural structure to the LLM agent.

If this is right

PbD systems should segment recorded action sequences into named subgoal groups rather than deliver flat step lists.
The organizational advantage appears only when task descriptions leave procedural details ambiguous.
Subgoal grouping by itself accounts for the observed gain; additional annotations such as preconditions provide no extra benefit.
Any system supplying procedural context to an LLM agent can adopt the same segmentation practice.
On tasks whose instructions already specify the full procedure, demonstrations add no value regardless of format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same grouping principle could be tested on non-web agent domains such as robotic manipulation or code generation.
Agents might be trained to infer subgoal boundaries automatically from flat logs when explicit labels are unavailable.
Varying the depth or granularity of the hierarchy offers a testable way to measure how much structure is optimal for different task classes.
Combining hierarchical demonstrations with other context-compression techniques could be examined to determine whether the two approaches compound.

Load-bearing premise

The only systematic difference among the four demonstration formats is their organizational structure; action sequences, the underlying LLM, and evaluation criteria remain identical across all 85 tasks.

What would settle it

A controlled replication on a fresh set of vague-description tasks in which hierarchical grouping produces no higher pass rate than flat logs or in which flat logs outperform the grouped format.

Figures

Figures reproduced from arXiv: 2606.20978 by Henry Lieberman, Honjar Xing, Jefferson Lin.

**Figure 1.** Figure 1: Pass rates on 43 natural-language tasks (3 reps, majority vote). Structured demonstrations (C3, C4) improve pass rates from 76.7% to 90.7% (paired permutation test p=0.034; McNemar p=0.031; win-loss 6:0). Error bars show approximate 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

Programming by Demonstration (PbD) offers a human-centered way to author procedural knowledge for LLM agents: users communicate what they want by showing rather than by writing prompts or code, making agent authoring accessible to non-programmers. The natural output of a PbD recording is a flat action log, but how this log is organized before being passed to the agent is an open design question with significant consequences for plan quality. We propose grouping recorded actions into labeled, hierarchical subgoals and evaluate the effect of this organizational structure in a controlled experiment. Across 85 web automation tasks, we compare a zero-shot baseline against four demonstration formats that share identical action sequences but differ in structure. On 43 natural-language tasks with vague descriptions, hierarchically grouped demonstrations improve pass rates from 76.7\% to 90.7\% (paired permutation test $p{=}0.034$; win-loss 6:0), while flat demonstrations show a smaller, non-significant improvement. On 42 tasks with precise descriptions, no format provides any benefit, confirming that the hierarchical advantage arises specifically when descriptions leave procedural details ambiguous. Ablation shows that subgoal grouping alone drives the effect: preconditions, postconditions, and parameter annotations add no measurable benefit. These results offer a concrete design recommendation for PbD pipelines and, more broadly, for any system that feeds procedural context to an LLM agent: segment action sequences into named subgoal groups rather than presenting flat step lists.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Hierarchical grouping of demos lifts pass rates on vague tasks from 76.7% to 90.7% in a controlled comparison, with the effect isolated to subgoal labels.

read the letter

The main finding is that organizing the same action sequences into labeled hierarchical subgoals improves LLM agent success on natural-language tasks with vague descriptions, while flat logs help less and precise descriptions need no extra structure at all.

The paper runs a clean comparison across 85 web automation tasks. Four demonstration formats share identical actions and differ only in how they are organized. They include a zero-shot baseline, report pass rates, and use a paired permutation test. The ablation is useful: it shows that subgoal grouping alone drives the gain, while preconditions, postconditions, and parameter annotations add nothing measurable. The split into 43 vague and 42 precise tasks produces a differential result that fits the claim that hierarchy supplies missing procedural detail when the description is ambiguous.

This is a straightforward empirical contribution in the PbD-for-LLM-agents setting. The design keeps the underlying data fixed, which makes the attribution to structure credible. The numbers are reported with enough detail to be checked.

The domain is narrow—all tasks are web automation—so it is unclear whether the same pattern would appear in code, robotics, or other procedural settings. The key comparison rests on 43 tasks and a 6:0 win-loss record, which is modest in scale even if the p-value reaches 0.034. Task selection and scoring criteria would need scrutiny in review.

The work is aimed at researchers who build systems that turn human demonstrations into context for agents. It gives a simple, low-cost design rule that is directly testable.

The experiment is controlled enough to merit peer review. Referees can verify the task set and evaluation, but the core comparison and ablation are worth the time.

Referee Report

0 major / 3 minor

Summary. The paper claims that for LLM-based web automation agents, demonstrations grouped into labeled hierarchical subgoals outperform flat action logs and zero-shot baselines specifically on 43 tasks with vague natural-language descriptions (pass rate rising from 76.7% to 90.7%, paired permutation test p=0.034, win-loss 6:0), while flat formats yield only a smaller non-significant gain; on 42 tasks with precise descriptions no format helps; an ablation isolates subgoal grouping as the sole driver, with preconditions, postconditions, and parameter annotations adding no benefit. All formats share identical action sequences and differ only in organizational structure.

Significance. If the controlled comparison and ablation hold, the result supplies a concrete, low-cost design rule for programming-by-demonstration pipelines: segment action logs into named subgoal groups rather than flat lists when task descriptions leave procedural details ambiguous. The differential effect across vague versus precise descriptions and the explicit isolation of grouping strengthen the practical takeaway for any system supplying procedural context to LLM agents.

minor comments (3)

[Abstract, §4] Abstract and §4: the 43/42 task split and the criteria used to classify descriptions as “vague” versus “precise” should be stated explicitly so readers can assess whether the differential result is robust to alternative partitions.
[§5] The manuscript reports a paired permutation test (p=0.034) and win-loss count but does not indicate whether the test was pre-registered or whether any correction for multiple comparisons across the four formats was applied; a brief statement would increase transparency.
[§5] Table or figure presenting per-task outcomes for the four formats would allow direct inspection of the 6:0 win-loss pattern and the magnitude of the flat-format improvement.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the accurate summary of our results, the positive significance assessment, and the recommendation of minor revision. No major comments were provided in the report.

Circularity Check

0 steps flagged

No circularity: purely empirical comparison of demonstration formats

full rationale

The paper reports a controlled experiment comparing four demonstration formats that share identical action sequences and differ only in organizational structure. Pass rates are measured directly on 85 tasks with statistical tests (paired permutation test). No equations, fitted parameters, predictions derived from inputs, self-citations as load-bearing premises, or ansatzes are present. The central claim (hierarchical grouping improves performance on vague tasks) is supported by the ablation and differential results across task types, with no reduction of outputs to inputs by construction. This is a standard empirical study with no derivation chain to inspect for circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the claim rests on the experimental design isolating structure as the causal factor and on pass rate being a faithful measure of plan quality.

axioms (1)

domain assumption Pass rate on the chosen web automation tasks is a valid and unbiased proxy for plan quality produced by the LLM agent.
The paper uses pass rates as the primary outcome to compare demonstration formats.

pith-pipeline@v0.9.1-grok · 5788 in / 1071 out tokens · 41480 ms · 2026-06-26T16:48:19.178507+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 5 canonical work pages · 2 internal anchors

[1]

arXiv preprint arXiv:2510.10049 , year =

Jiawen Li and Zheng Ning and Yuan Tian and Toby Jia-Jun Li , title =. arXiv preprint arXiv:2510.10049 , year =

work page arXiv
[2]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and others , title =. arXiv preprint arXiv:2602.12670 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Agent Skills Overview , howpublished =
[4]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. ICLR , year =
[5]

NeurIPS , year =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su , title =. NeurIPS , year =
[6]

NeurIPS , year =

Hongxin Li and Jingran Su and Yuntao Chen and Qing Li and Zhaoxiang Zhang , title =. NeurIPS , year =
[7]

Agent Workflow Memory

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , title =. arXiv preprint arXiv:2409.07429 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2504.13805 , year=

Guangyi Liu and Pengxiang Zhao and Liang Liu and Zhiming Chen and Yuxiang Chai and Shuai Ren and Hao Wang and Shibo He and Wenchao Meng , title =. arXiv preprint arXiv:2504.13805 , year =

work page arXiv
[9]

Your Wish Is My Command: Programming by Example , publisher =
[10]

arXiv preprint arXiv:2411.10541 , year=

Jia He and Mukund Rungta and David Koleczek and Arshdeep Sekhon and Franklin X. Wang and Sadid Hasan , title =. arXiv preprint arXiv:2411.10541 , year =

work page arXiv
[11]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =
[12]

EMNLP , year =

Sewon Min and Xinxi Lyu and Ari Holtzman and Mikel Artetxe and Mike Lewis and Hannaneh Hajishirzi and Luke Zettlemoyer , title =. EMNLP , year =
[13]

ICLR , year =

Melanie Sclar and Yejin Choi and Yulia Tsvetkov and Alane Suhr , title =. ICLR , year =
[14]

ACL , year =

Gaurav Verma and Rachneet Kaur and Nishan Srishankar and Zhen Zeng and Tucker Balch and Manuela Veloso , title =. ACL , year =
[15]

Findings of ACL , year =

Yiduo Guo and Zekai Zhang and Yaobo Liang and Dongyan Zhao and Nan Duan , title =. Findings of ACL , year =
[16]

ICLR , year =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , title =. ICLR , year =
[17]

Transactions on Machine Learning Research , year =

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Transactions on Machine Learning Research , year =
[18]

Griffiths and Yuan Cao and Karthik Narasimhan , title =

Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Thomas L. Griffiths and Yuan Cao and Karthik Narasimhan , title =. NeurIPS , year =

[1] [1]

arXiv preprint arXiv:2510.10049 , year =

Jiawen Li and Zheng Ning and Yuan Tian and Toby Jia-Jun Li , title =. arXiv preprint arXiv:2510.10049 , year =

work page arXiv

[2] [2]

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Xiangyi Li and Wenbo Chen and Yimin Liu and Shenghan Zheng and others , title =. arXiv preprint arXiv:2602.12670 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Agent Skills Overview , howpublished =

[4] [4]

Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. ICLR , year =

[5] [5]

NeurIPS , year =

Xiang Deng and Yu Gu and Boyuan Zheng and Shijie Chen and Samuel Stevens and Boshi Wang and Huan Sun and Yu Su , title =. NeurIPS , year =

[6] [6]

NeurIPS , year =

Hongxin Li and Jingran Su and Yuntao Chen and Qing Li and Zhaoxiang Zhang , title =. NeurIPS , year =

[7] [7]

Agent Workflow Memory

Zora Zhiruo Wang and Jiayuan Mao and Daniel Fried and Graham Neubig , title =. arXiv preprint arXiv:2409.07429 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2504.13805 , year=

Guangyi Liu and Pengxiang Zhao and Liang Liu and Zhiming Chen and Yuxiang Chai and Shuai Ren and Hao Wang and Shibo He and Wenchao Meng , title =. arXiv preprint arXiv:2504.13805 , year =

work page arXiv

[9] [9]

Your Wish Is My Command: Programming by Example , publisher =

[10] [10]

arXiv preprint arXiv:2411.10541 , year=

Jia He and Mukund Rungta and David Koleczek and Arshdeep Sekhon and Franklin X. Wang and Sadid Hasan , title =. arXiv preprint arXiv:2411.10541 , year =

work page arXiv

[11] [11]

Chi and Quoc V

Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed H. Chi and Quoc V. Le and Denny Zhou , title =. NeurIPS , year =

[12] [12]

EMNLP , year =

Sewon Min and Xinxi Lyu and Ari Holtzman and Mikel Artetxe and Mike Lewis and Hannaneh Hajishirzi and Luke Zettlemoyer , title =. EMNLP , year =

[13] [13]

ICLR , year =

Melanie Sclar and Yejin Choi and Yulia Tsvetkov and Alane Suhr , title =. ICLR , year =

[14] [14]

ACL , year =

Gaurav Verma and Rachneet Kaur and Nishan Srishankar and Zhen Zeng and Tucker Balch and Manuela Veloso , title =. ACL , year =

[15] [15]

Findings of ACL , year =

Yiduo Guo and Zekai Zhang and Yaobo Liang and Dongyan Zhao and Nan Duan , title =. Findings of ACL , year =

[16] [16]

ICLR , year =

Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik Narasimhan and Yuan Cao , title =. ICLR , year =

[17] [17]

Transactions on Machine Learning Research , year =

Guanzhi Wang and Yuqi Xie and Yunfan Jiang and Ajay Mandlekar and Chaowei Xiao and Yuke Zhu and Linxi Fan and Anima Anandkumar , title =. Transactions on Machine Learning Research , year =

[18] [18]

Griffiths and Yuan Cao and Karthik Narasimhan , title =

Shunyu Yao and Dian Yu and Jeffrey Zhao and Izhak Shafran and Thomas L. Griffiths and Yuan Cao and Karthik Narasimhan , title =. NeurIPS , year =