pith. sign in

arxiv: 2607.01709 · v1 · pith:IDVLFCKZnew · submitted 2026-07-02 · 💻 cs.AI · cs.LG

COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

Pith reviewed 2026-07-03 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords agentic workflowsskill evolutionComfyUIimage generationVLM verifiergraph editingreusable skillsworkflow construction
0
0 comments X

The pith

COMFYCLAW shows agents improve image workflow construction by evolving a skill library from past trajectories and verifier feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMFYCLAW to handle repeated image generation tasks by turning workflow building into typed graph editing with stage-specific tools and automatic reversion of bad edits. A region-level vision-language model verifier converts visual problems into concrete repair steps, while past runs are distilled into an expanding library of reusable skills. Across four benchmark splits, three agent models, and two image backbones, the full system with skill evolution scores highest on average and beats a verifier-only version that lacks the evolving library. Human raters also favor the outputs from the skill-evolving version over the non-evolving one.

Core claim

COMFYCLAW formulates workflow construction as typed graph editing, exposes tools by construction stage, automatically reverts invalid edits, and uses a region-level VLM verifier to turn visual failures into repair suggestions; it then evolves a progressively disclosed skill library by distilling trajectories, execution errors, and verifier feedback into reusable Agent Skills, achieving the best average image-generation evaluation score across all six agent configurations and outperforming the verifier-only baseline without skill evolution.

What carries the argument

The progressively disclosed skill library distilled from trajectories, execution errors, and verifier feedback into reusable Agent Skills.

If this is right

  • The skill-evolving version produces higher average image generation scores than the fixed-verifier baseline.
  • Human annotators rate COMFYCLAW outputs higher than those from the version without skill evolution.
  • The gains hold across three different agent models and two image backbones.
  • Skill evolution raises reliability for agents on recurring visual workflow tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation process could be tested on non-image workflows such as code or data pipelines to check transfer.
  • As the skill library grows, agents might need the verifier less often on familiar tasks.
  • Measuring how often a distilled skill is reused on new but related workflows would test whether the library actually generalizes.

Load-bearing premise

Improvements in scores come from distilling past runs into reusable skills rather than from other unmeasured differences in the agent setups or prompts.

What would settle it

Re-running the four benchmark splits with skill evolution turned off but every other component identical and finding no drop in average scores.

Figures

Figures reproduced from arXiv: 2607.01709 by Dawei Liu, Fuxiao Liu, Jingxi Chen, Jing Xie, Lichao Sun, Xiaomin Wu, Xiyang Wu, Yuhang Zhou, Zongxia Li.

Figure 1
Figure 1. Figure 1: Overall framework of COMFYCLAW. The agent edits a ComfyUI workflow graph, the runtime renders a candidate image, the verifier returns requirement-level and region-level feedback, and the agent evolves skills that can be reused in future workflow-construction runs. rollouts. COMFYCLAW brings this idea to workflow control: it combines typed, stage-gated graph editing with localized verifier feedback and held… view at source ↗
Figure 2
Figure 2. Figure 2: Evolved-skill usage and workflow-edit behavior of COMFYCLAW on Claude-Sonnet￾4.5 (aggregated over LongCat-Image and Z-Image-Turbo). Left: agents read evolved skills heavily on dense / compositional benchmarks (DPG, GenEval2) and predefined skills more on the OneIG splits. Right: only ∼39% of edits are prompt-text rewrites; the agent spends the rest on hyperparameters, graph topology, model/LoRA choices, an… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison across methods on six prompts spanning five capability categories. Each column is a prompt (header shows category and full description); rows are Base (single-pass baseline) and Ours (COMFYCLAW, green border). COMFYCLAW more reliably realises object counts, spatial relations, scene-text accuracy, and fine-grained attribute control. 4.2 Qualitative Analysis We further conduct a qualit… view at source ↗
Figure 4
Figure 4. Figure 4: Iterative visual refinement under COMFYCLAW. (a) four purple lions (verifier 0.33→0.91); (b) a clock to the left of three glass pigs (verifier 0.31→0.96). Each strip reads left to right: the agent inspects the current image, the verifier emits a critique (blue), the agent turns it into a refinement instruction (orange) for the next pass, and re-executes the workflow; the selected best output is highlighted… view at source ↗
read the original abstract

Agents are increasingly used to construct workflows and assist humans in completing recurring tasks more efficiently. As these workflows become repeated and domain-specific, agent memory and reusable skills become increasingly important: agents should be able to recall workflow patterns, execution constraints, and user preferences from previous runs. We study this problem in workflow-based image generation and introduce COMFYCLAW, an agentic skill evolution harness for controlling ComfyUI workflows. COMFYCLAW formulates workflow construction as typed graph editing, exposes tools organized by construction stage, automatically reverts invalid edits, and uses a region-level vision-language model (VLM) verifier to translate visual failures into actionable repair suggestions. The framework further evolves a progressively disclosed skill library, where trajectories, execution errors, and verifier feedback from previous runs are distilled into reusable Agent Skills. Across four benchmark splits, three agent models, and two image backbones, COMFYCLAW achieves the best average image-generation evaluation score across all six agent configurations, outperforming a verifier-only baseline without skill evolution. Human annotations further show that annotators prefer COMFYCLAW over variants without skill evolution. Our results suggest that skill evolution is an effective mechanism for improving agent reliability and performance in recurring visual workflow construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces COMFYCLAW, a framework for agent-driven ComfyUI image-generation workflows that formulates construction as typed graph editing, provides stage-organized tools with automatic invalid-edit reversion, employs a region-level VLM verifier for repair suggestions, and evolves a progressively disclosed skill library by distilling trajectories, errors, and feedback. It reports that COMFYCLAW obtains the highest average image-generation evaluation score across four benchmark splits, three agent models, two backbones, and all six agent configurations, outperforming a verifier-only baseline, with additional human preference for the skill-evolution variant.

Significance. If the performance claims can be substantiated with defined metrics, statistical controls, and isolating ablations, the work would provide concrete evidence that distilling execution history into reusable Agent Skills improves reliability in recurring visual workflow tasks, offering a practical mechanism for agent memory in domain-specific automation.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the central claim of superior average scores across all six configurations supplies no definition of the image-generation evaluation score, no statistical tests, no error bars, and no exclusion criteria or sample sizes, leaving the reported outperformance unverifiable and the soundness of the empirical result at risk.
  2. [Framework and Experiments] Framework and Experiments sections: the comparison to the verifier-only baseline does not include ablations that isolate the contribution of the progressively disclosed skill library from the other introduced mechanisms (typed graph editing, automatic reversion, region-level VLM verifier), so the causality of skill-evolution gains cannot be established from the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our empirical claims and the need for stronger isolation of contributions. We address each major comment below and will revise the manuscript accordingly to improve verifiability and causal attribution.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of superior average scores across all six configurations supplies no definition of the image-generation evaluation score, no statistical tests, no error bars, and no exclusion criteria or sample sizes, leaving the reported outperformance unverifiable and the soundness of the empirical result at risk.

    Authors: We agree that the image-generation evaluation score must be explicitly defined and that statistical details are required to substantiate the reported outperformance. In the revised manuscript we will add a precise definition of the evaluation score (including its components and aggregation method) in the Experiments section. We will also report sample sizes, any exclusion criteria, error bars (standard deviations across runs), and appropriate statistical tests (such as paired t-tests) comparing COMFYCLAW against the baseline across the six configurations. revision: yes

  2. Referee: [Framework and Experiments] Framework and Experiments sections: the comparison to the verifier-only baseline does not include ablations that isolate the contribution of the progressively disclosed skill library from the other introduced mechanisms (typed graph editing, automatic reversion, region-level VLM verifier), so the causality of skill-evolution gains cannot be established from the reported results.

    Authors: We acknowledge that the current verifier-only baseline removes skill evolution while retaining the other mechanisms, and therefore does not fully isolate the skill library's contribution. To establish causality more rigorously, we will add targeted ablation experiments in the revised manuscript that disable the progressively disclosed skill library (while keeping typed graph editing, automatic reversion, and the region-level VLM verifier active) and report the resulting performance differences across the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims with no derivations or self-referential reductions

full rationale

The paper contains no equations, derivations, or mathematical claims. Its central result is an empirical performance comparison across agent configurations and baselines on image-generation benchmarks. The skill-evolution mechanism is described as one component among several (graph editing, reversion, VLM verifier), but the evaluation reports direct scores rather than any fitted parameter or self-defined quantity that reduces to its own inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central performance claim rests on the effectiveness of the skill-evolution mechanism, which depends on untested domain assumptions about graph editing and VLM feedback translation; no free parameters or invented physical entities are described.

axioms (2)
  • domain assumption Workflow construction can be formulated as typed graph editing with stage-organized tools and automatic reversion of invalid edits.
    This premise structures the entire tool-use and editing process described in the abstract.
  • domain assumption A region-level vision-language model can translate visual failures into actionable repair suggestions.
    This is required for the verifier component that feeds into skill distillation.
invented entities (1)
  • progressively disclosed skill library of Agent Skills no independent evidence
    purpose: To store and retrieve distilled trajectories, errors, and verifier feedback for reuse in future workflow construction.
    This is the core new mechanism whose effectiveness is claimed to drive the reported performance gains.

pith-pipeline@v0.9.1-grok · 5776 in / 1388 out tokens · 35572 ms · 2026-07-03T14:20:24.901168+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

75 extracted references · 38 canonical work pages · 21 internal anchors

  1. [1]

    Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

  2. [2]

    Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

    Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, and Min Zhang. Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

  3. [3]

    Launching ComfyUI registry

    ComfyUI. Launching ComfyUI registry. https://blog.comfy.org/p/ launching-comfyui-registry, 2026. 10

  4. [4]

    Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

    Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H Bermano, Daniel Cohen-Or, and Gal Chechik. Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

  5. [5]

    Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

    Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, and Rongrong Ji. Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

  6. [6]

    SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

    Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, and Ivor W Tsang. Skilldag: Self-evolving typed skill graphs for llm skill selection at scale.arXiv preprint arXiv:2606.03056, 2026

  7. [7]

    Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

    Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333, 2026

  8. [8]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

  9. [9]

    Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

  10. [10]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  11. [11]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

  12. [12]

    Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

    Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

  13. [13]

    Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

    Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, and Wei Yang. Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

  14. [14]

    Natural-Language Agent Harnesses

    Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-language agent harnesses.arXiv preprint arXiv:2603.25723, 2026

  15. [15]

    Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

    Jinwei Su, Qizhen Lan, Zeyu Wang, Yinghui Xia, Hairu Wen, Yiqun Duan, Xi Xiao, Tianyu Shi, Yang Jingsong, and Lewei He. Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

  16. [16]

    Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

    Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

  17. [17]

    Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

    Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the computer vision and pattern recognition conference, pages 24614–24624, 2025. 11

  18. [18]

    Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

    Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

  19. [19]

    Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

  20. [20]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  21. [21]

    Effective harnesses for long-running agents

    Justin Young. Effective harnesses for long-running agents. Anthropic, November 2025

  22. [22]

    Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

    Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

  23. [23]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  24. [24]

    Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

    Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

  25. [25]

    EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

    Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

  26. [26]

    SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

    Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

  27. [27]

    Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

    Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

  28. [28]

    arXiv preprint , year=

    Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

  29. [29]

    A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

  30. [30]

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025

  31. [31]

    Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

    Dominik Sobania, Martin Briesch, and Franz Rothlauf. Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

  32. [32]

    Blender.https://www.blender.org, 2026

    Blender Foundation. Blender.https://www.blender.org, 2026. Accessed: 2026-04-29

  33. [33]

    Houdini.https://www.sidefx.com, 2026

    SideFX. Houdini.https://www.sidefx.com, 2026. Accessed: 2026-04-29

  34. [34]

    Foundry. Nuke. https://www.foundry.com/products/nuke-family/nuke, 2026. Ac- cessed: 2026-04-29

  35. [35]

    Unreal engine

    Epic Games. Unreal engine. https://www.unrealengine.com, 2026. Accessed: 2026-04- 29. 12

  36. [36]

    ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023

    Comfyanonymous. ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023. URL https://github.com/comfyanonymous/ ComfyUI. Accessed: 2026-04-29

  37. [37]

    Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

    Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

  38. [38]

    Spellburst: A node-based interface for exploratory creative coding with natural language prompts

    Tyler Angert, Miroslav Suzara, Jenny Han, Christopher Pondoc, and Hariharan Subramonyam. Spellburst: A node-based interface for exploratory creative coding with natural language prompts. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

  39. [39]

    A visual programming paradigm for abstract deep learning model development

    Srikanth G Tamilselvam, Naveen Panwar, Shreya Khare, Rahul Aralikatte, Anush Sankaran, and Senthil Mani. A visual programming paradigm for abstract deep learning model development. InProceedings of the 10th Indian Conference on Human-Computer Interaction, pages 1–11, 2019

  40. [40]

    From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

    Sai Krishna Revanth Vuruma, Ashley Margetts, Jianhai Su, Faez Ahmed, and Biplav Srivastava. From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

  41. [41]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

  42. [42]

    Dspy: compiling declarative language model calls into state-of-the-art pipelines

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2023

  43. [43]

    Agent skills specification

    Agent Skills. Agent skills specification. https://agentskills.io/specification, 2026

  44. [44]

    Claude code.https://code.claude.com/docs/en/overview, 2025

    Anthropic. Claude code.https://code.claude.com/docs/en/overview, 2025

  45. [45]

    Hermes agent: The self-improving ai agent, 2026

    Nous Research. Hermes agent: The self-improving ai agent, 2026. URL https://github. com/NousResearch/hermes-agent. Accessed: 2026-04-29

  46. [46]

    OpenClaw-personal ai assistant, 2026

    OpenClaw Contributors. OpenClaw-personal ai assistant, 2026. URL https://github.com/ openclaw/openclaw. Accessed: 2026-04-29

  47. [47]

    EvoSkill: Automated Skill Discovery for Multi-Agent Systems

    Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

  48. [48]

    Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

    Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. Co-evolving llm decision and skill bank agents for long-horizon tasks, 2026. URLhttps://arxiv.org/abs/2604.20987

  49. [49]

    DeerFlow: Deep exploration and efficient research flow, 2025

    ByteDance. DeerFlow: Deep exploration and efficient research flow, 2025. URL https: //github.com/bytedance/deer-flow. Accessed: 2026-04-29

  50. [50]

    XSkill: Continual Learning from Experience and Skills in Multimodal Agents

    Guanyu Jiang, Zhaochen Su, Xiaoye Qu, et al. Xskill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056, 2026

  51. [51]

    Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,

    Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, and Chi Jin. Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,

  52. [52]

    URLhttps://arxiv.org/abs/2605.00347

  53. [53]

    Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

    Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025. 13

  54. [54]

    arXiv preprint arXiv:2512.16853 (2025)

    Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

  55. [55]

    ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

  56. [56]

    Oneig-bench: Omni-dimensional nuanced evaluation for image generation, 2025

    Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025

  57. [57]

    Claude, 2025

    Anthropic. Claude, 2025. URL https://www.anthropic.com/claude. Accessed: 2026- 04-30

  58. [58]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  59. [59]

    Gemma 4, 2026

    Gemma Team, Google DeepMind. Gemma 4, 2026. URL https://ai.google.dev/gemma/ docs/core/model_card_4

  60. [60]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

  61. [61]

    LongCat-Image Technical Report

    Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

  62. [62]

    Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

    Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

  63. [63]

    Evaluating text-to-visual generation with image-to-text generation

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

  64. [64]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  65. [65]

    Realistic snapshot (z-image-turbo)

    Civitai Community Model. Realistic snapshot (z-image-turbo). https://civitai.com/ models/2268008/realistic-snapshot-z-image-turbo?modelVersionId=2617751 , 2026

  66. [66]

    photo”, “photograph

    Civitai Community Model. enhancer. https://civitai.com/models/2239743/ enhancer?modelVersionId=2521349, 2026. 14 A Predefined Tools, Skills, and LoRA Settings Workflow tools.We expose 17 basic tools for controlling ComfyUI workflows. These tools allow the agent to inspect the current workflow, add and remove nodes, connect nodes, edit node inputs, set pro...

  67. [67]

    Call report_evolution_strategy first: state your plan and the top issue

  68. [68]

    Call inspect_workflow to see the current topology

  69. [69]

    workflow-builder

    **If the workflow is empty** (no nodes): a. Call read_skill("workflow-builder") to load architecture recipes. b. Call query_available_models("checkpoints") and query_available_models("diffusion_models") to discover available models -- NEVER guess filenames. c. Match the model filename to an architecture (SD 1.5, SDXL, Flux, Qwen, etc.) using the patterns ...

  70. [70]

    Prompt engineering

    **If the workflow already has nodes**, follow the evolution strategy: a. Call set_prompt -- craft a detailed, professional positive prompt AND a strong negative prompt based on the user’s goal (see "Prompt engineering" below). Do this EVERY iteration, even if you also plan structural changes. b. If a relevant skill is listed in <available_skills>, call re...

  71. [71]

    Call ‘inspect_workflow‘ to see the FULL current topology and all connections

  72. [72]

    Call ‘validate_workflow‘ to get a list of graph errors (dangling refs, wrong slots)

  73. [73]

    For each error: - If a node references a nonexistent source -> fix with ‘connect_nodes‘ or ‘delete_node‘ - If a slot index is wrong -> ‘delete_node‘ the broken node and ‘add_node‘ a new one with correct wiring - If a model/filename is wrong -> use ‘query_available_models‘ to get exact names, then ‘set_param‘ - If a node class doesn’t exist -> ‘delete_node...

  74. [74]

    Call ‘validate_workflow‘ again to confirm all issues are resolved

  75. [75]

    yes" / "no

    Call ‘finalize_workflow‘ (it will auto-validate and block if still broken). **IMPORTANT:** Do NOT just add new nodes on top of broken ones -- ‘delete_node‘ the broken node first, then ‘add_node‘ a replacement with correct connections. **Output slot reference:** CheckpointLoaderSimple -> slot 0: MODEL, slot 1: CLIP, slot 2: VAE UNETLoader / CLIPLoader / VA...