Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

Osmar R. Zaiane; Yashar Talebirad; Yongbin Kim

arxiv: 2606.30911 · v2 · pith:KZ5TAGFTnew · submitted 2026-06-29 · 💻 cs.AI · cs.LG· cs.MA

Why Solve It Twice? Hierarchical Accumulation of Skills for Transfer-Efficient ML Engineering

Yongbin Kim , Yashar Talebirad , Osmar R. Zaiane This is my paper

Pith reviewed 2026-07-02 20:12 UTC · model grok-4.3

classification 💻 cs.AI cs.LGcs.MA

keywords hierarchical multi-agent systemsskill accumulationtransfer learningML engineering agentsKaggle competitionsknowledge organizationabstraction mechanisms

0 comments

The pith

Organizing ML engineering skills into global, domain, and competition-specific tiers lets agents transfer knowledge across tasks instead of rediscovering it each time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a hierarchical multi-agent system that maintains a growing inventory of skills partitioned by scope so later competitions begin with relevant knowledge already loaded. An orchestrator uses LLM-driven abstraction to promote patterns from specific work into reusable higher-tier skills. Controlled tests holding a 159-skill set fixed across eight competitions show tiered loading reaches a 100 percent medal rate while flat loading of the identical skills yields only 62.5 percent, the same rate obtained with an empty inventory, and consumes twice the output tokens. On the full 22-competition benchmark the system attains a 77.3 percent medal rate, warm starts cut refinement iterations by 52 percent, and the fraction of proposed changes kept by the agent rises from 42 percent to 85 percent once fifty or more skills are available.

Core claim

HASTE maintains a 159-skill inventory partitioned into three tiers and uses an orchestrator to abstract and move knowledge upward. When the same inventory is reloaded in tiered form across eight competitions the medal rate reaches 100 percent; flat reloading of the identical skills yields only 62.5 percent, identical to starting with an empty inventory. On the full twenty-two-competition benchmark the system attains 77.3 percent medals. Warm-start runs require 52 percent fewer iterations and retain 85 percent of proposed changes once fifty or more skills are present.

What carries the argument

The three-tier hierarchy (global, domain, competition-specific) coupled to an LLM orchestrator that performs abstraction and promotion between tiers.

If this is right

Tiered loading of a fixed skill inventory doubles the medal rate relative to flat loading or no skills.
Warm-start competitions require 52 percent fewer refinement iterations than cold starts.
The fraction of agent-proposed changes that are kept rises from 42 percent to 85 percent as the skill inventory grows past fifty items.
Knowledge organization can reduce the need for stronger base models or larger compute budgets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tiering principle could reduce repeated discovery in other sequential agent tasks such as iterative software development.
If the abstraction step scales without introducing errors, larger inventories become feasible without proportional token costs.
Comparing cold-start and warm-start curves on new benchmarks would quantify how much hierarchy substitutes for raw model capability.

Load-bearing premise

The skills admit a stable partition into the three scope tiers and the orchestrator abstractions remain accurate and non-misleading when reloaded in new competitions.

What would settle it

Running the identical 159-skill inventory through the eight competitions once with tiered loading and once with flat loading, and finding that the tiered condition no longer produces a 100 percent medal rate or that abstractions degrade later performance.

Figures

Figures reproduced from arXiv: 2606.30911 by Osmar R. Zaiane, Yashar Talebirad, Yongbin Kim.

**Figure 1.** Figure 1: HASTE architecture. The Orchestrator assigns competitions to domain Specialists. Each Specialist loads relevant skills, executes the pipeline (profile → prototype → refine → ensemble), and produces learnings. Between rounds, the Orchestrator promotes generalizable learnings upward through the hierarchy via LLM-driven abstraction. Specialist. Given a competition t and the scope-loaded skills supplied by the… view at source ↗

**Figure 2.** Figure 2: Controlled ablation in Section 4.5. All conditions use the same 159-skill inventory, model, pipeline, and budget. Tiered loading medals on all 8 competitions; flat and empty both medal on 5 of 8. 4.5. Ablation: Tiered vs. Flat vs. Empty Skill Loading We test the loading function directly while holding skill inventory constant, varying only skill organization. We run 8 competitions spanning NLP, vision, ta… view at source ↗

**Figure 3.** Figure 3: Abbreviated prototype screen prompt. The full prompt injects measured GPU/CPU/RAM from a resource probe and up to 2000 characters of accumulated skills. The LLM returns three model specifications, each executed with 1-fold validation on full data. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Abbreviated refinement proposal prompt. The tier label rotates through Exploring, Optimizing, and Fine-tuning. The six-mode failure taxonomy gives the LLM structured diagnostic guidance. The decision field allows the LLM to self-escalate tiers or terminate early. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: The learning production prompt. Each specialist reflects on its full experiment history after completing a competition. Learnings are saved to the competition tier and later evaluated for promotion by the orchestrator. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: The skill promotion prompt. The orchestrator evaluates all new learnings against existing skills after each round. Abstractions strip competition-specific details to produce reusable domain or global knowledge. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

ML engineering agents waste compute rediscovering known techniques because every competition is a cold start. We present HASTE, a hierarchical multi-agent system that organizes cross-competition knowledge into three scope tiers (global, domain, and competition-specific), each coupled to a matching agent level. An orchestrator coordinates domain specialists and promotes learning between tiers via LLM-driven abstraction. A controlled ablation provides evidence for scoped loading: holding a 159-skill inventory constant across 8 competitions, tiered loading achieves a 100% medal rate while flat loading reaches only 62.5%, the same medal rate as loading no skills, and consumes 2x the output tokens. On the full MLE-Bench Lite benchmark (22 Kaggle competitions), HASTE reaches a medal rate of 77.3% using Claude Sonnet 4.6 at 12h per competition; this is a single-seed campaign result, and multi-seed replication is the priority follow-up. In a cold-start run, the system begins with no accumulated skills. In warm-start runs, it reloads skills learned from earlier competitions, using only global and domain-level skills for transfer across competitions. Warm starts use 52% fewer refinement iterations, and the fraction of proposed changes kept by the agent rises from 42% at low inventory to 85% once 50+ skills are available. These results suggest that better knowledge organization can partly substitute for model strength and compute budget in ML-engineering agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HASTE shows a solid ablation win for tiered skill loading over flat on fixed inventory, but the tier assignment and abstraction steps lack described validation.

read the letter

The main takeaway is that tiered loading of a fixed 159-skill inventory across 8 competitions reaches 100% medal rate while flat loading hits only 62.5%, matching the no-skill baseline, and the warm-start runs cut iterations by 52% with rising change acceptance as inventory grows.

What is new is the three-tier structure (global, domain, competition-specific) matched to agent levels, plus the orchestrator that abstracts and promotes skills between tiers via LLM. The controlled ablation on medal rates and token use, plus the MLE-Bench Lite result of 77.3%, gives a concrete data point for the subfield.

The paper does well by holding the skill inventory constant in the ablation and reporting the efficiency difference in output tokens. The transfer metrics in warm starts are a useful practical signal.

The soft spots are the single-seed runs with no error bars or tests, which weakens confidence in the size of the gap. The stress-test concern lands: the description does not supply decision rules, prompts, or checks for how skills are partitioned into tiers or whether the abstractions stay accurate and non-misleading on reload. Without that, the advantage could partly reflect curation differences rather than the hierarchy itself. The full methods would need to address reproducibility here.

This is aimed at researchers building LLM agents for ML engineering and data science automation. Readers tracking knowledge reuse and transfer in agents will find the numbers worth examining. It has enough of an experimental comparison on a public benchmark to deserve a serious referee, though it will need more runs and methods detail.

I would send it to peer review with targeted requests for the tiering protocol and multi-seed results.

Referee Report

2 major / 2 minor

Summary. The paper introduces HASTE, a hierarchical multi-agent system for ML engineering that partitions accumulated skills into three scope tiers (global, domain, competition-specific) managed by matching agent levels, with an orchestrator performing LLM-driven abstraction to transfer knowledge across competitions. Holding a fixed 159-skill inventory across 8 competitions, the ablation reports tiered loading reaching 100% medal rate while flat loading reaches only 62.5% (identical to no skills) and uses twice the output tokens. On the full MLE-Bench Lite (22 competitions) the system attains 77.3% medal rate in a single-seed run; warm-start runs reduce refinement iterations by 52% and raise the fraction of kept changes from 42% to 85% once 50+ skills are available.

Significance. If the tiered organization and abstraction mechanism prove robust under replication, the work demonstrates that explicit knowledge scoping can substitute for additional model scale or compute in ML agents. The fixed-inventory ablation is a methodological strength because it isolates the contribution of hierarchical loading from skill-set growth. The manuscript correctly flags single-seed replication as the immediate next step.

major comments (2)

[Abstract] Abstract (tiered loading paragraph): the 100% vs 62.5% medal-rate gap is presented as evidence for scoped loading, yet the manuscript supplies no decision rules, prompts, or validation protocol for partitioning the 159 skills into the three tiers or for confirming that LLM abstractions remain accurate and non-misleading when reloaded; without these the result could reflect implicit per-competition curation absent from the flat baseline.
[Abstract] Abstract (ablation description): the quantitative claim rests on a single-seed experiment with no error bars or statistical tests reported, even though the text itself identifies multi-seed replication as the priority follow-up; this leaves the magnitude of the tiered-loading advantage vulnerable to run-to-run variance.

minor comments (2)

[Abstract] The abstract states results for 'Claude Sonnet 4.6'; confirm the exact model identifier and version used in the experiments.
[Abstract] The manuscript notes that warm-start runs reload only global and domain-level skills; clarify whether competition-specific skills are ever carried forward or are always reset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the fixed-inventory ablation as a methodological strength. We address each major comment below, indicating where revisions will be made to improve clarity and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract (tiered loading paragraph): the 100% vs 62.5% medal-rate gap is presented as evidence for scoped loading, yet the manuscript supplies no decision rules, prompts, or validation protocol for partitioning the 159 skills into the three tiers or for confirming that LLM abstractions remain accurate and non-misleading when reloaded; without these the result could reflect implicit per-competition curation absent from the flat baseline.

Authors: We agree that the manuscript does not supply the requested decision rules, prompts, or validation protocol for tier partitioning and abstraction accuracy. This is a valid concern that could affect interpretation of whether the tiered advantage is due to hierarchy or unstated curation. In the revised version we will add a new subsection detailing: (1) explicit criteria for assigning skills to global/domain/competition-specific tiers, (2) the LLM prompt templates used for abstraction when reloading skills, and (3) a validation protocol (manual review of a sample of abstractions for accuracy). Example partitions from the 159-skill set will also be included. These additions will be placed in the methods section to support reproducibility. revision: yes
Referee: [Abstract] Abstract (ablation description): the quantitative claim rests on a single-seed experiment with no error bars or statistical tests reported, even though the text itself identifies multi-seed replication as the priority follow-up; this leaves the magnitude of the tiered-loading advantage vulnerable to run-to-run variance.

Authors: The manuscript already states that both the 77.3% result and the ablation are single-seed and flags multi-seed replication as the priority follow-up. We acknowledge that the absence of error bars or statistical tests leaves the 100% vs 62.5% gap open to variance concerns. Because the experiments are computationally expensive we cannot add new seeds for this revision. We will, however, revise the abstract and results to more prominently state the single-seed limitation and note the lack of variance estimates. This improves presentation while preserving the existing claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results are direct experimental outcomes

full rationale

The paper's central claims rest on controlled ablation experiments that hold a fixed 159-skill inventory constant and directly measure medal rates (100% tiered vs 62.5% flat) and iteration counts across competitions. These are empirical measurements rather than quantities derived from fitted parameters, self-referential equations, or load-bearing self-citations. No equations are presented that reduce a prediction to its own inputs by construction, and the skill-tier partitioning and abstraction steps are described as experimental conditions without being justified via prior self-citation chains or uniqueness theorems. The derivation chain is therefore self-contained against the reported benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that an LLM can reliably abstract concrete competition solutions into reusable tiered skills without introducing systematic errors that would invalidate later transfer; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption LLM-driven abstraction produces skills that remain useful when reloaded across competitions
Invoked in the description of the orchestrator coordinating learning between tiers.

pith-pipeline@v0.9.1-grok · 5808 in / 1391 out tokens · 23407 ms · 2026-07-02T20:12:54.411536+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 canonical work pages · 1 internal anchor

[1]

URLhttps://arxiv.org/abs/2303. 11366. Sumers, T. R., Yao, S., Narasimhan, K., and Grif- fiths, T. L. Cognitive architectures for language agents.Transactions on Machine Learning Research,
[2]

Sutton, R

URLhttps://openreview.net/forum? id=1i6ZCvflQJ. Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstrac- tion in reinforcement learning.Artificial Intelligence, 112:181–211, 1999. URLhttps://doi.org/10. 1016/S0004-3702(99)00052-1. Talebirad, Y ., Parsaee, A., Szepesvari, C. Y ., Nadiri, A., and Zaiane, O. Towa...

work page arXiv 1999
[3]

URLhttps://arxiv.org/abs/2305. 02499. Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .-J., and Huang, G. Expel: Llm agents are experiential learn- ers. InAAAI, 2024. URLhttps://doi.org/10. 1609/aaai.v38i17.29936. Zhu, X., Chen, Y ., Tian, H., et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with t...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

UNDERFITTING --- train score near baseline, small gap
[5]

OVERFITTING --- large train-val gap, high train score
[6]

FEATURE GAP --- top features dominate, score plateaus
[7]

NOISE CEILING --- high CV variance, score fluctuates
[8]

DISTRIBUTION MISMATCH --- train-val-test disagreement
[9]

Propose ONE atomic change to improve the score

DIMINISHING RETURNS --- each iter improves <0.1% Mention which mode applies and which past skill (if any) influenced your decision. Propose ONE atomic change to improve the score. Return JSON:{‘‘plan’’: ..., ‘‘change specification’’: ..., ‘‘decision’’: ‘‘CONTINUE’’ | ‘‘NEXT TIER’’ | ‘‘STOP’’} Figure 4.Abbreviated refinement proposal prompt. The tier label...
[10]

‘‘skip’’ --- already covered or too obvious
[11]

‘‘competition’’ --- too specific (dataset quirks, row indices)
[12]

Abstract it

‘‘domain’’ --- generalizable to similar tasks. Abstract it
[13]

‘‘global’’ --- universally useful across all ML tasks
[14]

Note conditions under which each holds

‘‘conflict’’ --- contradicts an existing learning. Note conditions under which each holds. ## Quality Standards - Be selective. Promote AT MOST 50% of learnings. - Abstractions MUST NOT mention specific competition names, dataset names, or exact score values. Bad: ‘‘On aerial-cactus, AUC reached 0.9997’’ Good: ‘‘When AUC is near ceiling (>0.999), further ...

2020

[1] [1]

URLhttps://arxiv.org/abs/2303. 11366. Sumers, T. R., Yao, S., Narasimhan, K., and Grif- fiths, T. L. Cognitive architectures for language agents.Transactions on Machine Learning Research,

[2] [2]

Sutton, R

URLhttps://openreview.net/forum? id=1i6ZCvflQJ. Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstrac- tion in reinforcement learning.Artificial Intelligence, 112:181–211, 1999. URLhttps://doi.org/10. 1016/S0004-3702(99)00052-1. Talebirad, Y ., Parsaee, A., Szepesvari, C. Y ., Nadiri, A., and Zaiane, O. Towa...

work page arXiv 1999

[3] [3]

URLhttps://arxiv.org/abs/2305. 02499. Zhao, A., Huang, D., Xu, Q., Lin, M., Liu, Y .-J., and Huang, G. Expel: Llm agents are experiential learn- ers. InAAAI, 2024. URLhttps://doi.org/10. 1609/aaai.v38i17.29936. Zhu, X., Chen, Y ., Tian, H., et al. Ghost in the minecraft: Generally capable agents for open-world environments via large language models with t...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

UNDERFITTING --- train score near baseline, small gap

[5] [5]

OVERFITTING --- large train-val gap, high train score

[6] [6]

FEATURE GAP --- top features dominate, score plateaus

[7] [7]

NOISE CEILING --- high CV variance, score fluctuates

[8] [8]

DISTRIBUTION MISMATCH --- train-val-test disagreement

[9] [9]

Propose ONE atomic change to improve the score

DIMINISHING RETURNS --- each iter improves <0.1% Mention which mode applies and which past skill (if any) influenced your decision. Propose ONE atomic change to improve the score. Return JSON:{‘‘plan’’: ..., ‘‘change specification’’: ..., ‘‘decision’’: ‘‘CONTINUE’’ | ‘‘NEXT TIER’’ | ‘‘STOP’’} Figure 4.Abbreviated refinement proposal prompt. The tier label...

[10] [10]

‘‘skip’’ --- already covered or too obvious

[11] [11]

‘‘competition’’ --- too specific (dataset quirks, row indices)

[12] [12]

Abstract it

‘‘domain’’ --- generalizable to similar tasks. Abstract it

[13] [13]

‘‘global’’ --- universally useful across all ML tasks

[14] [14]

Note conditions under which each holds

‘‘conflict’’ --- contradicts an existing learning. Note conditions under which each holds. ## Quality Standards - Be selective. Promote AT MOST 50% of learnings. - Abstractions MUST NOT mention specific competition names, dataset names, or exact score values. Bad: ‘‘On aerial-cactus, AUC reached 0.9997’’ Good: ‘‘When AUC is near ceiling (>0.999), further ...

2020