Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

Carl Kingsford; Jiayi Li; Jiayuan Liu; Shiyi Du; Vincent Conitzer; Weihua Du; Xiangliang Zhang; Yingtao Luo; Yue Huang

arxiv: 2604.25012 · v1 · submitted 2026-04-27 · 💻 cs.LG

Why Search When You Can Transfer? Amortized Agentic Workflow Design from Structural Priors

Shiyi Du , Jiayuan Liu , Weihua Du , Yue Huang , Jiayi Li , Yingtao Luo , Xiangliang Zhang , Vincent Conitzer

show 1 more author

Carl Kingsford

This is my paper

Pith reviewed 2026-05-08 04:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords agentic workflowsworkflow designamortized optimizationstructural priorsfew-shot transferLLM agentssearch-based methodsbenchmark evaluation

0 comments

The pith

By distilling structural priors from past optimizations, SWIFT synthesizes agent workflows for new tasks in a single LLM pass that outperforms per-task search at far lower cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches to agent workflow design run expensive iterative search anew for each task, without reusing knowledge gained from earlier work. The paper observes that successful workflows within a domain converge on a small set of similar topologies. SWIFT extracts reusable priors consisting of compositional heuristics and interface contracts through contrastive analysis of search trajectories on source tasks. For a fresh target task the method prompts a language model once using those priors plus a handful of cross-task workflow examples to produce a complete executable workflow. This amortizes the design process so that new tasks incur negligible extra cost while still delivering stronger results than search on the evaluated benchmarks.

Core claim

We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT, a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely.

What carries the argument

SWIFT framework that distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories to condition a single LLM generation pass using structural priors and workflow demonstrations.

If this is right

SWIFT outperforms the state-of-the-art search-based method on five benchmarks
Reduces marginal per-task optimization cost by three orders of magnitude
Generalizes to four additional unseen benchmarks
Transfers successfully across foundation models
Retains over 93 percent of average performance when all operator names are replaced with random strings

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Maintaining libraries of domain-specific workflow topologies could support rapid synthesis for entire application areas
The dominance of topology over surface semantics points to future emphasis on abstract graph structures in workflow transfer
Amortization via structural priors may reduce search costs in adjacent problems such as automated tool composition or multi-step planning
Agent systems could accumulate and refine structural knowledge across successive deployments rather than restarting optimization each time

Load-bearing premise

Optimized workflows in a domain converge to a small family of similar topologies so that priors from a few source tasks suffice for new ones.

What would settle it

Applying SWIFT to a new collection of tasks whose optimal workflows exhibit high structural diversity and checking whether its results fall below those of per-task iterative search.

Figures

Figures reproduced from arXiv: 2604.25012 by Carl Kingsford, Jiayi Li, Jiayuan Liu, Shiyi Du, Vincent Conitzer, Weihua Du, Xiangliang Zhang, Yingtao Luo, Yue Huang.

**Figure 1.** Figure 1: Iterative search (left) vs. SWIFT (right). view at source ↗

**Figure 2.** Figure 2: Overview of SWIFT. Offline: contrastive trajectory distillation extracts compositional heuristics (H) and output contracts (C) from source-task search histories. Online: a single LLM call synthesizes an executable workflow for an unseen task, conditioned on operator definitions, distilled priors, and leave-one-out demonstrations. The workflow is directly executed on test instances. 3.1 Problem Formulation… view at source ↗

**Figure 3.** Figure 3: Effect of demo count (k) on performance. GSM8K saturates at k=1; MATH jumps at k=4, suggesting complex tasks need more demos to disambiguate workflow topology. The empirical success of SWIFT invites a reevaluation of how the field currently approaches autonomous agent design. By transitioning from test-time combinatorial search to amortized synthesis guided by demonstrations, our findings highlight se… view at source ↗

**Figure 4.** Figure 4: OOD performance overview. Each SWIFT bar decomposes the full test set into view at source ↗

**Figure 5.** Figure 5: AQuA failure type distribution (34 failures). Computation errors dominate (38%), view at source ↗

**Figure 6.** Figure 6: AIME per-problem results. Each cell represents one problem; green = correct, view at source ↗

**Figure 7.** Figure 7: BigCodeBench failure type breakdown (542 failures). The dominant failure view at source ↗

**Figure 8.** Figure 8: Problem length distribution (words) for all 11 benchmarks. Filled bars show view at source ↗

**Figure 9.** Figure 9: Topic distribution in the MATH test set (486 Level-5 problems). view at source ↗

**Figure 10.** Figure 10: Synthesized workflow topologies with vs. without output contracts. With view at source ↗

read the original abstract

Automated agentic workflow design currently relies on per-task iterative search, which is computationally prohibitive and fails to reuse structural knowledge across tasks. We observe that optimized workflows converge to a small family of domain-specific topologies, suggesting that this combinatorial search is largely redundant. Building on this insight, we propose SWIFT (Synthesizing Workflows via Few-shot Transfer), a framework that amortizes workflow design into reusable structural priors. SWIFT first distills compositional heuristics and output-interface contracts from contrastive analysis of prior search trajectories across source tasks. At inference time, it conditions a single LLM generation pass on these priors together with cross-task workflow demonstrations to synthesize a complete, executable workflow for an unseen target task, bypassing iterative search entirely. On five benchmarks, SWIFT outperforms the state-of-the-art search-based method while reducing marginal per-task optimization cost by three orders of magnitude. It further generalizes to four additional unseen benchmarks and transfers successfully from GPT-4o-mini to three additional foundation models (Grok, Qwen, Gemma). Controlled ablations reveal that workflow demonstrations primarily transfer topological structure rather than surface semantics: replacing all operator names with random strings still retains over 93% of the full system's average performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SWIFT amortizes agent workflow design by distilling structural priors from past searches for single-pass synthesis, with clear efficiency gains but resting on an unquantified claim about topology convergence.

read the letter

The core idea is that instead of running expensive per-task search to build agent workflows, you can extract reusable structural patterns from earlier successful searches and feed them to an LLM for one-shot generation on new tasks. SWIFT does this by contrasting trajectories across source tasks to pull out compositional heuristics and output-interface contracts, then conditions a single generation pass on those priors plus a few cross-task examples. That is the actual novelty here, and it is a direct response to the compute problem in current search-based methods for agentic design. The paper reports that this beats the prior SOTA search approach on five benchmarks while cutting marginal per-task cost by three orders of magnitude, generalizes to four held-out benchmarks, and transfers from GPT-4o-mini to Grok, Qwen, and Gemma. The ablation that replaces operator names with random strings yet keeps over 93 percent of performance is a clean way to show that topology, not surface labels, is what carries over. Those pieces are concrete and worth noting. The soft spot is the starting premise that optimized workflows converge to a small family of domain-specific topologies. The abstract presents this as an observation that makes the amortization possible, but there are no supporting numbers such as graph-edit distances, topology clustering, or diversity metrics across the source tasks. Without that, it is not obvious how narrow the family really is or how reliably the distilled priors will hold up on harder unseen cases. The performance edge could shrink if the topologies turn out more varied than assumed. Experimental details on exact baselines, variance, and statistical tests are also missing from what is visible, so the strength of the outperformance claims is still moderate. This paper is for people working on scalable LLM agents and automated planning who need lower inference-time costs. A reader focused on practical deployment of complex agents would find the method and the ablation useful to think about. It has enough of a distinct approach and empirical story to deserve a serious referee rather than a desk reject, even though the topology claim needs tightening in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SWIFT, a framework that amortizes agentic workflow design by distilling structural priors (compositional heuristics and output-interface contracts) from contrastive analysis of search trajectories on source tasks. It rests on the observation that optimized workflows converge to a small family of domain-specific topologies, enabling single-pass LLM generation for unseen target tasks without iterative search. The authors report that SWIFT outperforms the state-of-the-art search-based method on five benchmarks while reducing marginal per-task optimization cost by three orders of magnitude, generalizes to four additional unseen benchmarks, and transfers successfully from GPT-4o-mini to Grok, Qwen, and Gemma. A controlled ablation shows that replacing operator names with random strings retains over 93% of average performance, suggesting topological structure drives the transfer.

Significance. If the convergence assumption holds and the results are robust, the work could meaningfully advance agentic AI by replacing expensive per-task combinatorial search with reusable priors, enabling scalable deployment. The reported cross-benchmark generalization, cross-model transfer, and the ablation isolating topological transfer (rather than surface semantics) are notable strengths that provide concrete evidence for the approach's value.

major comments (2)

[Abstract and §1] Abstract and §1: The central claim that 'optimized workflows converge to a small family of domain-specific topologies' is presented as an empirical observation enabling the amortized method, yet no quantitative support is provided such as graph-edit-distance distributions, topology clustering results, or diversity metrics across the five source benchmarks. This assumption is load-bearing for arguing that per-task search is largely redundant and for the claimed three-order-of-magnitude cost reduction plus cross-benchmark generalization.
[Experiments section (results tables and §4/§5)] Experiments section (results tables and §4/§5): The outperformance and generalization claims are reported without details on the number of independent runs, statistical significance tests, or variance measures, despite the stochastic nature of LLM-based workflow generation and optimization. This makes it difficult to assess whether the gains over the search-based baseline are reliable and reproducible.

minor comments (2)

[Method section] The terms 'compositional heuristics' and 'output-interface contracts' are used throughout but would benefit from explicit formal definitions or illustrative examples in the method section to improve clarity for readers.
[Figures and tables] Figure captions and table legends should explicitly reference the specific benchmarks and models used in each panel to aid quick interpretation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make the indicated revisions to strengthen the empirical grounding and statistical reporting in the manuscript.

read point-by-point responses

Referee: [Abstract and §1] The central claim that 'optimized workflows converge to a small family of domain-specific topologies' is presented as an empirical observation enabling the amortized method, yet no quantitative support is provided such as graph-edit-distance distributions, topology clustering results, or diversity metrics across the five source benchmarks. This assumption is load-bearing for arguing that per-task search is largely redundant and for the claimed three-order-of-magnitude cost reduction plus cross-benchmark generalization.

Authors: We agree that explicit quantitative metrics would provide stronger support for the convergence observation. While the performance gains, cross-benchmark generalization, and ablation isolating topological transfer (retaining >93% performance with randomized operator names) offer indirect evidence, we will add direct metrics in the revision. Specifically, we will report graph-edit-distance distributions across optimized workflows from the source tasks, results from topology clustering (e.g., hierarchical clustering on workflow graphs represented as DAGs), and diversity metrics such as the number of unique topologies and their frequency per domain. These additions will quantify the 'small family' claim and better substantiate the redundancy of per-task search. revision: yes
Referee: [Experiments section (results tables and §4/§5)] The outperformance and generalization claims are reported without details on the number of independent runs, statistical significance tests, or variance measures, despite the stochastic nature of LLM-based workflow generation and optimization. This makes it difficult to assess whether the gains over the search-based baseline are reliable and reproducible.

Authors: We acknowledge that the current reporting lacks explicit details on run counts, variance, and significance testing. In the revised Experiments section (§4/§5) and tables, we will specify the number of independent runs (e.g., 5 runs per configuration to account for LLM stochasticity), include mean performance with standard deviations or error bars, and report statistical significance (e.g., paired t-tests or Wilcoxon tests) for comparisons against the search-based baseline. This will improve reproducibility assessment without altering the core results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical priors from external trajectories validated on held-out tasks

full rationale

The paper's chain starts from an empirical observation of workflow convergence drawn from prior search trajectories on source tasks, proceeds to contrastive distillation of heuristics/contracts, and applies single-pass LLM generation on unseen targets using those priors plus demonstrations. Reported gains (outperformance on five benchmarks, 3-order cost reduction, generalization to four more benchmarks, and cross-model transfer) are measured against external search-based baselines and ablations (e.g., random operator names retaining 93% performance). No equations, self-definitional reductions, fitted-input-as-prediction steps, or load-bearing self-citations appear that would make any result equivalent to its inputs by construction. The topology-convergence premise is presented as an input observation rather than a derived necessity, leaving the framework self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests primarily on the domain assumption that workflows share reusable topologies; the SWIFT framework itself is the main invented construct, with no explicit free parameters beyond choices in the LLM prompting process.

axioms (1)

domain assumption Optimized workflows converge to a small family of domain-specific topologies
Stated directly in the abstract as the observation that makes per-task search redundant.

invented entities (1)

Structural priors no independent evidence
purpose: Distilled compositional heuristics and output-interface contracts used to condition LLM generation
New construct extracted from prior trajectories; no independent falsifiable handle outside the SWIFT pipeline is provided.

pith-pipeline@v0.9.0 · 5538 in / 1349 out tokens · 28530 ms · 2026-05-08T04:07:04.589491+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

[1]

Extract ONLY the final numerical answer

, 38instruction =( 39" Extract ONLY the final numerical answer . " 40" No units , no explanation . " 41) 42) 43 44returnfinal ['response'] ,self. llm . cost_manager . total_cost MATH.Same multi-path + ensemble structure as GSM8K, with a L ATEX \boxed{} extraction step tailored to the MATH answer format Listing 2: MATH workflow. 15 Preprint 1classWorkflow ...

work page
[2]

Extract ONLY the final answer

, 36instruction =( 37" Extract ONLY the final answer " 38" in \\ boxed {} format . " 39) 40) 41 42return( 43format ted_answ er ['response'] , 44self. llm . cost_manager . total_cost 45) HumanEval.A test-driven retry loop: generate code, run unit tests, and if the tests fail, analyze the error and regenerate, up to two repair attempts. Listing 3: HumanEval...

work page
[3]

Analyze why this code failed

, 43instruction =( 44" Analyze why this code failed " 45" and how to fix it " 46) 47) 48 49improved =await self. c u s t o m _ c o d e _ g e n e r a t e ( 50problem = problem , 51entry_point = entry_point , 52instruction =( 53f " Fix the code based on : " 54f " { analysis ['response']} " 55) 56) 57 58test_result2 =await self. test ( 59problem = problem , ...

work page
[4]

Given the problem and code output ,

, 24instruction =( 25" Given the problem and code output , " 26" provide a detailed solution with LaTeX . " 27" Show step - by - step calculations . " 28" Present the final answer in " 29" \\ boxed {} notation . " 30) 31) 32 33# Step 3: Generate 3 additional reasoning paths 34solutions = [ fo r ma t t ed _ s ol u t io n ['response']] 35for_in range(3) : 3...

work page
[5]

Two correct implementations of the same function will almost never be textually identical

Surface-form diversity.Functionally equivalent programs can differ vastly in variable names, control flow structure, whitespace, and style. Two correct implementations of the same function will almost never be textually identical

work page
[6]

disagreement,

No semantic equivalence check.Text-based voting treats each character difference as a “disagreement,” so even trivially equivalent programs (e.g., for i in range(n) vs. for i in range(0, n)) are counted as distinct solutions

work page
[7]

compromise

Chimera outputs.When no clear majority exists, the ensemble prompt may produce a “compromise” snippet that combines fragments from multiple solutions, resulting in code that does not compile or execute. Observation.Across our experiments, we found that synthesized workflows which rely solely on ScEnsemble for code tasks consistently underperform those tha...

work page 2024
[8]

Big- CodeBench (complex code tasks dissimilar from HumanEval/MBPP) shows the smallest (+2.7)

Transfer effectiveness scales with task similarity.SVAMP and AQuA (math benchmarks similar to GSM8K/MATH) show the largest gains (+14.8, +27.0). Big- CodeBench (complex code tasks dissimilar from HumanEval/MBPP) shows the smallest (+2.7). The synthesized workflow’s strategies transfer best when the OOD task shares structural similarities with the demonstr...

work page
[9]

Ensemble voting has diminishing returns on hard problems.On SVAMP (easy), ensemble corrects many individual reasoning errors, yielding 95.8% accuracy. On 25 Preprint Dataset Fail Rate Primary Mode Why workflow helps / doesn’t help SVAMP 4.2% Adversarial wording Ensemble catches many errors; adversarial premises defeat all paths AQuA 16.7% Computation erro...

work page
[10]

Nodes” counts total operator invocations; “Depth

Infrastructure failures mask model capability.BigCodeBench’s 59.4% failure rate suggests poor transfer, but 71% of failures are environmental (pyparsing, etc.). Ad- justing for this, the true logic-error rate is ∼17%, comparable to MBPP’s 17.9%. The workflow’s code generation quality transfers well; the bottleneck is the execution sandbox, not the synthes...

work page

[1] [1]

Extract ONLY the final numerical answer

, 38instruction =( 39" Extract ONLY the final numerical answer . " 40" No units , no explanation . " 41) 42) 43 44returnfinal ['response'] ,self. llm . cost_manager . total_cost MATH.Same multi-path + ensemble structure as GSM8K, with a L ATEX \boxed{} extraction step tailored to the MATH answer format Listing 2: MATH workflow. 15 Preprint 1classWorkflow ...

work page

[2] [2]

Extract ONLY the final answer

, 36instruction =( 37" Extract ONLY the final answer " 38" in \\ boxed {} format . " 39) 40) 41 42return( 43format ted_answ er ['response'] , 44self. llm . cost_manager . total_cost 45) HumanEval.A test-driven retry loop: generate code, run unit tests, and if the tests fail, analyze the error and regenerate, up to two repair attempts. Listing 3: HumanEval...

work page

[3] [3]

Analyze why this code failed

, 43instruction =( 44" Analyze why this code failed " 45" and how to fix it " 46) 47) 48 49improved =await self. c u s t o m _ c o d e _ g e n e r a t e ( 50problem = problem , 51entry_point = entry_point , 52instruction =( 53f " Fix the code based on : " 54f " { analysis ['response']} " 55) 56) 57 58test_result2 =await self. test ( 59problem = problem , ...

work page

[4] [4]

Given the problem and code output ,

, 24instruction =( 25" Given the problem and code output , " 26" provide a detailed solution with LaTeX . " 27" Show step - by - step calculations . " 28" Present the final answer in " 29" \\ boxed {} notation . " 30) 31) 32 33# Step 3: Generate 3 additional reasoning paths 34solutions = [ fo r ma t t ed _ s ol u t io n ['response']] 35for_in range(3) : 3...

work page

[5] [5]

Two correct implementations of the same function will almost never be textually identical

Surface-form diversity.Functionally equivalent programs can differ vastly in variable names, control flow structure, whitespace, and style. Two correct implementations of the same function will almost never be textually identical

work page

[6] [6]

disagreement,

No semantic equivalence check.Text-based voting treats each character difference as a “disagreement,” so even trivially equivalent programs (e.g., for i in range(n) vs. for i in range(0, n)) are counted as distinct solutions

work page

[7] [7]

compromise

Chimera outputs.When no clear majority exists, the ensemble prompt may produce a “compromise” snippet that combines fragments from multiple solutions, resulting in code that does not compile or execute. Observation.Across our experiments, we found that synthesized workflows which rely solely on ScEnsemble for code tasks consistently underperform those tha...

work page 2024

[8] [8]

Big- CodeBench (complex code tasks dissimilar from HumanEval/MBPP) shows the smallest (+2.7)

Transfer effectiveness scales with task similarity.SVAMP and AQuA (math benchmarks similar to GSM8K/MATH) show the largest gains (+14.8, +27.0). Big- CodeBench (complex code tasks dissimilar from HumanEval/MBPP) shows the smallest (+2.7). The synthesized workflow’s strategies transfer best when the OOD task shares structural similarities with the demonstr...

work page

[9] [9]

Ensemble voting has diminishing returns on hard problems.On SVAMP (easy), ensemble corrects many individual reasoning errors, yielding 95.8% accuracy. On 25 Preprint Dataset Fail Rate Primary Mode Why workflow helps / doesn’t help SVAMP 4.2% Adversarial wording Ensemble catches many errors; adversarial premises defeat all paths AQuA 16.7% Computation erro...

work page

[10] [10]

Nodes” counts total operator invocations; “Depth

Infrastructure failures mask model capability.BigCodeBench’s 59.4% failure rate suggests poor transfer, but 71% of failures are environmental (pyparsing, etc.). Ad- justing for this, the true logic-error rate is ∼17%, comparable to MBPP’s 17.9%. The workflow’s code generation quality transfers well; the bottleneck is the execution sandbox, not the synthes...

work page