COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

Dawei Liu; Fuxiao Liu; Jingxi Chen; Jing Xie; Lichao Sun; Xiaomin Wu; Xiyang Wu; Yuhang Zhou; Zongxia Li

arxiv: 2607.01709 · v1 · pith:IDVLFCKZnew · submitted 2026-07-02 · 💻 cs.AI · cs.LG

COMFYCLAW: Self-Evolving Skill Harnesses for Image Generation Workflows

Zongxia Li , Dawei Liu , Fuxiao Liu , Yuhang Zhou , Xiyang Wu , Jingxi Chen , Jing Xie , Xiaomin Wu

show 1 more author

Lichao Sun

This is my paper

Pith reviewed 2026-07-03 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords agentic workflowsskill evolutionComfyUIimage generationVLM verifiergraph editingreusable skillsworkflow construction

0 comments

The pith

COMFYCLAW shows agents improve image workflow construction by evolving a skill library from past trajectories and verifier feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces COMFYCLAW to handle repeated image generation tasks by turning workflow building into typed graph editing with stage-specific tools and automatic reversion of bad edits. A region-level vision-language model verifier converts visual problems into concrete repair steps, while past runs are distilled into an expanding library of reusable skills. Across four benchmark splits, three agent models, and two image backbones, the full system with skill evolution scores highest on average and beats a verifier-only version that lacks the evolving library. Human raters also favor the outputs from the skill-evolving version over the non-evolving one.

Core claim

COMFYCLAW formulates workflow construction as typed graph editing, exposes tools by construction stage, automatically reverts invalid edits, and uses a region-level VLM verifier to turn visual failures into repair suggestions; it then evolves a progressively disclosed skill library by distilling trajectories, execution errors, and verifier feedback into reusable Agent Skills, achieving the best average image-generation evaluation score across all six agent configurations and outperforming the verifier-only baseline without skill evolution.

What carries the argument

The progressively disclosed skill library distilled from trajectories, execution errors, and verifier feedback into reusable Agent Skills.

If this is right

The skill-evolving version produces higher average image generation scores than the fixed-verifier baseline.
Human annotators rate COMFYCLAW outputs higher than those from the version without skill evolution.
The gains hold across three different agent models and two image backbones.
Skill evolution raises reliability for agents on recurring visual workflow tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distillation process could be tested on non-image workflows such as code or data pipelines to check transfer.
As the skill library grows, agents might need the verifier less often on familiar tasks.
Measuring how often a distilled skill is reused on new but related workflows would test whether the library actually generalizes.

Load-bearing premise

Improvements in scores come from distilling past runs into reusable skills rather than from other unmeasured differences in the agent setups or prompts.

What would settle it

Re-running the four benchmark splits with skill evolution turned off but every other component identical and finding no drop in average scores.

Figures

Figures reproduced from arXiv: 2607.01709 by Dawei Liu, Fuxiao Liu, Jingxi Chen, Jing Xie, Lichao Sun, Xiaomin Wu, Xiyang Wu, Yuhang Zhou, Zongxia Li.

**Figure 1.** Figure 1: Overall framework of COMFYCLAW. The agent edits a ComfyUI workflow graph, the runtime renders a candidate image, the verifier returns requirement-level and region-level feedback, and the agent evolves skills that can be reused in future workflow-construction runs. rollouts. COMFYCLAW brings this idea to workflow control: it combines typed, stage-gated graph editing with localized verifier feedback and held… view at source ↗

**Figure 2.** Figure 2: Evolved-skill usage and workflow-edit behavior of COMFYCLAW on Claude-Sonnet4.5 (aggregated over LongCat-Image and Z-Image-Turbo). Left: agents read evolved skills heavily on dense / compositional benchmarks (DPG, GenEval2) and predefined skills more on the OneIG splits. Right: only ∼39% of edits are prompt-text rewrites; the agent spends the rest on hyperparameters, graph topology, model/LoRA choices, an… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison across methods on six prompts spanning five capability categories. Each column is a prompt (header shows category and full description); rows are Base (single-pass baseline) and Ours (COMFYCLAW, green border). COMFYCLAW more reliably realises object counts, spatial relations, scene-text accuracy, and fine-grained attribute control. 4.2 Qualitative Analysis We further conduct a qualit… view at source ↗

**Figure 4.** Figure 4: Iterative visual refinement under COMFYCLAW. (a) four purple lions (verifier 0.33→0.91); (b) a clock to the left of three glass pigs (verifier 0.31→0.96). Each strip reads left to right: the agent inspects the current image, the verifier emits a critique (blue), the agent turns it into a refinement instruction (orange) for the next pass, and re-executes the workflow; the selected best output is highlighted… view at source ↗

read the original abstract

Agents are increasingly used to construct workflows and assist humans in completing recurring tasks more efficiently. As these workflows become repeated and domain-specific, agent memory and reusable skills become increasingly important: agents should be able to recall workflow patterns, execution constraints, and user preferences from previous runs. We study this problem in workflow-based image generation and introduce COMFYCLAW, an agentic skill evolution harness for controlling ComfyUI workflows. COMFYCLAW formulates workflow construction as typed graph editing, exposes tools organized by construction stage, automatically reverts invalid edits, and uses a region-level vision-language model (VLM) verifier to translate visual failures into actionable repair suggestions. The framework further evolves a progressively disclosed skill library, where trajectories, execution errors, and verifier feedback from previous runs are distilled into reusable Agent Skills. Across four benchmark splits, three agent models, and two image backbones, COMFYCLAW achieves the best average image-generation evaluation score across all six agent configurations, outperforming a verifier-only baseline without skill evolution. Human annotations further show that annotators prefer COMFYCLAW over variants without skill evolution. Our results suggest that skill evolution is an effective mechanism for improving agent reliability and performance in recurring visual workflow construction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

COMFYCLAW layers skill evolution onto ComfyUI agents but the gains are not isolated from the other new components.

read the letter

The main takeaway is that COMFYCLAW puts together typed graph editing, automatic reversion of bad edits, a region-level VLM verifier, and a growing library of distilled Agent Skills for ComfyUI image workflows. It reports the best average scores across four benchmark splits, three agent models, and two backbones, plus human preference over the version without skill evolution.

The concrete framework is the useful part. It gives a clear way to handle recurring visual tasks by turning past trajectories, errors, and verifier feedback into reusable skills that get disclosed progressively. That matches a practical need in agent-driven creative tooling where the same patterns repeat.

The soft spot is the missing isolation of the skill-evolution piece. The verifier-only baseline appears to lack the graph editing and reversion tools as well, so the performance lift could come from those mechanisms rather than the distilled skills being retrieved and used. The abstract also gives no metric definitions, statistical tests, error bars, or ablation results, which leaves the central claim without enough support to judge from the text.

This is for readers working on agent reliability inside specific tools like ComfyUI rather than general AI theory. A serious referee could check whether the full paper supplies the controls and details that are absent here; the problem is scoped and the approach is described enough to evaluate.

Referee Report

2 major / 0 minor

Summary. The paper introduces COMFYCLAW, a framework for agent-driven ComfyUI image-generation workflows that formulates construction as typed graph editing, provides stage-organized tools with automatic invalid-edit reversion, employs a region-level VLM verifier for repair suggestions, and evolves a progressively disclosed skill library by distilling trajectories, errors, and feedback. It reports that COMFYCLAW obtains the highest average image-generation evaluation score across four benchmark splits, three agent models, two backbones, and all six agent configurations, outperforming a verifier-only baseline, with additional human preference for the skill-evolution variant.

Significance. If the performance claims can be substantiated with defined metrics, statistical controls, and isolating ablations, the work would provide concrete evidence that distilling execution history into reusable Agent Skills improves reliability in recurring visual workflow tasks, offering a practical mechanism for agent memory in domain-specific automation.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the central claim of superior average scores across all six configurations supplies no definition of the image-generation evaluation score, no statistical tests, no error bars, and no exclusion criteria or sample sizes, leaving the reported outperformance unverifiable and the soundness of the empirical result at risk.
[Framework and Experiments] Framework and Experiments sections: the comparison to the verifier-only baseline does not include ablations that isolate the contribution of the progressively disclosed skill library from the other introduced mechanisms (typed graph editing, automatic reversion, region-level VLM verifier), so the causality of skill-evolution gains cannot be established from the reported results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the clarity of our empirical claims and the need for stronger isolation of contributions. We address each major comment below and will revise the manuscript accordingly to improve verifiability and causal attribution.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the central claim of superior average scores across all six configurations supplies no definition of the image-generation evaluation score, no statistical tests, no error bars, and no exclusion criteria or sample sizes, leaving the reported outperformance unverifiable and the soundness of the empirical result at risk.

Authors: We agree that the image-generation evaluation score must be explicitly defined and that statistical details are required to substantiate the reported outperformance. In the revised manuscript we will add a precise definition of the evaluation score (including its components and aggregation method) in the Experiments section. We will also report sample sizes, any exclusion criteria, error bars (standard deviations across runs), and appropriate statistical tests (such as paired t-tests) comparing COMFYCLAW against the baseline across the six configurations. revision: yes
Referee: [Framework and Experiments] Framework and Experiments sections: the comparison to the verifier-only baseline does not include ablations that isolate the contribution of the progressively disclosed skill library from the other introduced mechanisms (typed graph editing, automatic reversion, region-level VLM verifier), so the causality of skill-evolution gains cannot be established from the reported results.

Authors: We acknowledge that the current verifier-only baseline removes skill evolution while retaining the other mechanisms, and therefore does not fully isolate the skill library's contribution. To establish causality more rigorously, we will add targeted ablation experiments in the revised manuscript that disable the progressively disclosed skill library (while keeping typed graph editing, automatic reversion, and the region-level VLM verifier active) and report the resulting performance differences across the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark claims with no derivations or self-referential reductions

full rationale

The paper contains no equations, derivations, or mathematical claims. Its central result is an empirical performance comparison across agent configurations and baselines on image-generation benchmarks. The skill-evolution mechanism is described as one component among several (graph editing, reversion, VLM verifier), but the evaluation reports direct scores rather than any fitted parameter or self-defined quantity that reduces to its own inputs by construction. No self-citation chains or uniqueness theorems are invoked to justify the result. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central performance claim rests on the effectiveness of the skill-evolution mechanism, which depends on untested domain assumptions about graph editing and VLM feedback translation; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Workflow construction can be formulated as typed graph editing with stage-organized tools and automatic reversion of invalid edits.
This premise structures the entire tool-use and editing process described in the abstract.
domain assumption A region-level vision-language model can translate visual failures into actionable repair suggestions.
This is required for the verifier component that feeds into skill distillation.

invented entities (1)

progressively disclosed skill library of Agent Skills no independent evidence
purpose: To store and retrieve distilled trajectories, errors, and verifier feedback for reuse in future workflow construction.
This is the core new mechanism whose effectiveness is claimed to drive the reported performance gains.

pith-pipeline@v0.9.1-grok · 5776 in / 1388 out tokens · 35572 ms · 2026-07-03T14:20:24.901168+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 38 canonical work pages · 21 internal anchors

[1]

Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025
[2]

Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, and Min Zhang. Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

work page arXiv 2025
[3]

Launching ComfyUI registry

ComfyUI. Launching ComfyUI registry. https://blog.comfy.org/p/ launching-comfyui-registry, 2026. 10

2026
[4]

Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H Bermano, Daniel Cohen-Or, and Gal Chechik. Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

work page arXiv 2024
[5]

Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, and Rongrong Ji. Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

work page arXiv 2025
[6]

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, and Ivor W Tsang. Skilldag: Self-evolving typed skill graphs for llm skill selection at scale.arXiv preprint arXiv:2606.03056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[8]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

work page arXiv 2024
[10]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024
[11]

Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

2024
[12]

Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

2025
[13]

Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, and Wei Yang. Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

work page arXiv 2025
[14]

Natural-Language Agent Harnesses

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-language agent harnesses.arXiv preprint arXiv:2603.25723, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

Jinwei Su, Qizhen Lan, Zeyu Wang, Yinghui Xia, Hairu Wen, Yiqun Duan, Xi Xiao, Tianyu Shi, Yang Jingsong, and Lewei He. Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

work page arXiv 2026
[16]

Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

work page arXiv 2026
[17]

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the computer vision and pattern recognition conference, pages 24614–24624, 2025. 11

2025
[18]

Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

work page arXiv 2026
[19]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023
[20]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023
[21]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. Anthropic, November 2025

2025
[22]

Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

work page arXiv 2026
[23]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

arXiv preprint arXiv:2511.10395 , year=

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

work page arXiv 2025
[29]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

Dominik Sobania, Martin Briesch, and Franz Rothlauf. Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

work page arXiv 2024
[32]

Blender.https://www.blender.org, 2026

Blender Foundation. Blender.https://www.blender.org, 2026. Accessed: 2026-04-29

2026
[33]

Houdini.https://www.sidefx.com, 2026

SideFX. Houdini.https://www.sidefx.com, 2026. Accessed: 2026-04-29

2026
[34]

Foundry. Nuke. https://www.foundry.com/products/nuke-family/nuke, 2026. Ac- cessed: 2026-04-29

2026
[35]

Unreal engine

Epic Games. Unreal engine. https://www.unrealengine.com, 2026. Accessed: 2026-04- 29. 12

2026
[36]

ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023

Comfyanonymous. ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023. URL https://github.com/comfyanonymous/ ComfyUI. Accessed: 2026-04-29

2023
[37]

Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

work page arXiv 2025
[38]

Spellburst: A node-based interface for exploratory creative coding with natural language prompts

Tyler Angert, Miroslav Suzara, Jenny Han, Christopher Pondoc, and Hariharan Subramonyam. Spellburst: A node-based interface for exploratory creative coding with natural language prompts. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

2023
[39]

A visual programming paradigm for abstract deep learning model development

Srikanth G Tamilselvam, Naveen Panwar, Shreya Khare, Rahul Aralikatte, Anush Sankaran, and Senthil Mani. A visual programming paradigm for abstract deep learning model development. InProceedings of the 10th Indian Conference on Human-Computer Interaction, pages 1–11, 2019

2019
[40]

From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

Sai Krishna Revanth Vuruma, Ashley Margetts, Jianhai Su, Faez Ahmed, and Biplav Srivastava. From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

work page arXiv 2024
[41]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2023

2023
[43]

Agent skills specification

Agent Skills. Agent skills specification. https://agentskills.io/specification, 2026

2026
[44]

Claude code.https://code.claude.com/docs/en/overview, 2025

Anthropic. Claude code.https://code.claude.com/docs/en/overview, 2025

2025
[45]

Hermes agent: The self-improving ai agent, 2026

Nous Research. Hermes agent: The self-improving ai agent, 2026. URL https://github. com/NousResearch/hermes-agent. Accessed: 2026-04-29

2026
[46]

OpenClaw-personal ai assistant, 2026

OpenClaw Contributors. OpenClaw-personal ai assistant, 2026. URL https://github.com/ openclaw/openclaw. Accessed: 2026-04-29

2026
[47]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. Co-evolving llm decision and skill bank agents for long-horizon tasks, 2026. URLhttps://arxiv.org/abs/2604.20987

work page internal anchor Pith review Pith/arXiv arXiv 2026
[49]

DeerFlow: Deep exploration and efficient research flow, 2025

ByteDance. DeerFlow: Deep exploration and efficient research flow, 2025. URL https: //github.com/bytedance/deer-flow. Accessed: 2026-04-29

2025
[50]

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, et al. Xskill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,

Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, and Chi Jin. Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,
[52]

URLhttps://arxiv.org/abs/2605.00347

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025. 13

2025
[54]

arXiv preprint arXiv:2512.16853 (2025)

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025
[55]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[56]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation, 2025

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025

work page arXiv 2025
[57]

Claude, 2025

Anthropic. Claude, 2025. URL https://www.anthropic.com/claude. Accessed: 2026- 04-30

2025
[58]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Gemma 4, 2026

Gemma Team, Google DeepMind. Gemma 4, 2026. URL https://ai.google.dev/gemma/ docs/core/model_card_4

2026
[60]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023
[63]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

2024
[64]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[65]

Realistic snapshot (z-image-turbo)

Civitai Community Model. Realistic snapshot (z-image-turbo). https://civitai.com/ models/2268008/realistic-snapshot-z-image-turbo?modelVersionId=2617751 , 2026

work page arXiv 2026
[66]

photo”, “photograph

Civitai Community Model. enhancer. https://civitai.com/models/2239743/ enhancer?modelVersionId=2521349, 2026. 14 A Predefined Tools, Skills, and LoRA Settings Workflow tools.We expose 17 basic tools for controlling ComfyUI workflows. These tools allow the agent to inspect the current workflow, add and remove nodes, connect nodes, edit node inputs, set pro...

work page arXiv 2026
[67]

Call report_evolution_strategy first: state your plan and the top issue
[68]

Call inspect_workflow to see the current topology
[69]

workflow-builder

**If the workflow is empty** (no nodes): a. Call read_skill("workflow-builder") to load architecture recipes. b. Call query_available_models("checkpoints") and query_available_models("diffusion_models") to discover available models -- NEVER guess filenames. c. Match the model filename to an architecture (SD 1.5, SDXL, Flux, Qwen, etc.) using the patterns ...
[70]

Prompt engineering

**If the workflow already has nodes**, follow the evolution strategy: a. Call set_prompt -- craft a detailed, professional positive prompt AND a strong negative prompt based on the user’s goal (see "Prompt engineering" below). Do this EVERY iteration, even if you also plan structural changes. b. If a relevant skill is listed in <available_skills>, call re...
[71]

Call ‘inspect_workflow‘ to see the FULL current topology and all connections
[72]

Call ‘validate_workflow‘ to get a list of graph errors (dangling refs, wrong slots)
[73]

For each error: - If a node references a nonexistent source -> fix with ‘connect_nodes‘ or ‘delete_node‘ - If a slot index is wrong -> ‘delete_node‘ the broken node and ‘add_node‘ a new one with correct wiring - If a model/filename is wrong -> use ‘query_available_models‘ to get exact names, then ‘set_param‘ - If a node class doesn’t exist -> ‘delete_node...
[74]

Call ‘validate_workflow‘ again to confirm all issues are resolved
[75]

yes" / "no

Call ‘finalize_workflow‘ (it will auto-validate and block if still broken). **IMPORTANT:** Do NOT just add new nodes on top of broken ones -- ‘delete_node‘ the broken node first, then ‘add_node‘ a replacement with correct connections. **Output slot reference:** CheckpointLoaderSimple -> slot 0: MODEL, slot 1: CLIP, slot 2: VAE UNETLoader / CLIPLoader / VA...

[1] [1]

Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Pu Cao, Feng Zhou, Qing Song, and Lu Yang. Controllable generation with text-to-image diffusion models: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

2025

[2] [2]

Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, and Min Zhang. Comfyui-r1: Exploring reasoning models for workflow generation.arXiv preprint arXiv:2506.09790, 2025

work page arXiv 2025

[3] [3]

Launching ComfyUI registry

ComfyUI. Launching ComfyUI registry. https://blog.comfy.org/p/ launching-comfyui-registry, 2026. 10

2026

[4] [4]

Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H Bermano, Daniel Cohen-Or, and Gal Chechik. Comfygen: Prompt-adaptive workflows for text-to-image generation.arXiv preprint arXiv:2410.01731, 2024

work page arXiv 2024

[5] [5]

Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

Oucheng Huang, Yuhang Ma, Zeng Zhao, Mingrui Wu, Jiayi Ji, Rongsheng Zhang, Zhipeng Hu, Xiaoshuai Sun, and Rongrong Ji. Comfygpt: A self-optimizing multi-agent system for comprehensive comfyui workflow generation.arXiv preprint arXiv:2503.17671, 2025

work page arXiv 2025

[6] [6]

SkillDAG: Self-Evolving Typed Skill Graphs for LLM Skill Selection at Scale

Tong Bai, Zhenglin Wan, Pengfei Zhou, Xingrui Yu, Wangbo Zhao, Yang You, and Ivor W Tsang. Skilldag: Self-evolving typed skill graphs for llm skill selection at scale.arXiv preprint arXiv:2606.03056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[7] [7]

Graph-of-Skills: Dependency-Aware Structural Retrieval for Massive Agent Skills

Dawei Liu, Zongxia Li, Hongyang Du, Xiyang Wu, Shihang Gui, Yongbei Kuang, and Lichao Sun. Graph of skills: Dependency-aware structural retrieval for massive agent skills.arXiv preprint arXiv:2604.05333, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[8] [8]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement.arXiv preprint arXiv:2402.07456, 2024

work page arXiv 2024

[10] [10]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

2024

[11] [11]

Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

2024

[12] [12]

Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados

Thibault Le Sellier de Chezelles, Maxime Gasse, Alexandre Lacoste, Massimo Caccia, Alexan- dre Drouin, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Gra- ham Neubig, Quentin Cappart, Russ Salakhutdinov, and Nicolas Chapados. The browsergym ecosys...

2025

[13] [13]

Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

Zhongwen Xu, Xianliang Wang, Siyi Li, Tao Yu, Liang Wang, Qiang Fu, and Wei Yang. Agents play thousands of 3d video games.arXiv preprint arXiv:2503.13356, 2025

work page arXiv 2025

[14] [14]

Natural-Language Agent Harnesses

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, and Hai-Tao Zheng. Natural-language agent harnesses.arXiv preprint arXiv:2603.25723, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

Jinwei Su, Qizhen Lan, Zeyu Wang, Yinghui Xia, Hairu Wen, Yiqun Duan, Xi Xiao, Tianyu Shi, Yang Jingsong, and Lewei He. Comfysearch: Autonomous exploration and reasoning for comfyui workflows.arXiv preprint arXiv:2601.04060, 2026

work page arXiv 2026

[16] [16]

Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

Kaixun Jiang, Yuzheng Wang, Junjie Zhou, Pandeng Li, Zhihang Liu, Chen-Wei Xie, Zhaoyu Chen, Yun Zheng, and Wenqiang Zhang. Genagent: Scaling text-to-image generation via agentic multimodal reasoning.arXiv preprint arXiv:2601.18543, 2026

work page arXiv 2026

[17] [17]

Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems

Xiangyuan Xue, Zeyu Lu, Di Huang, Zidong Wang, Wanli Ouyang, and Lei Bai. Comfybench: Benchmarking llm-based agents in comfyui for autonomously designing collaborative ai systems. InProceedings of the computer vision and pattern recognition conference, pages 24614–24624, 2025. 11

2025

[18] [18]

Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088, 2026

work page arXiv 2026

[19] [19]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

2023

[20] [20]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

2023

[21] [21]

Effective harnesses for long-running agents

Justin Young. Effective harnesses for long-running agents. Anthropic, November 2025

2025

[22] [22]

Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

work page arXiv 2026

[23] [23]

Meta-Harness: End-to-End Optimization of Model Harnesses

Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, et al. Evolver: Self-evolving llm agents through an experience-driven lifecycle.arXiv preprint arXiv:2510.16079, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Peng Xia, Jianwen Chen, Hanyang Wang, Jiaqi Liu, Kaide Zeng, Yu Wang, Siwei Han, Yiyang Zhou, Xujiang Zhao, Haifeng Chen, et al. Skillrl: Evolving agents via recursive skill-augmented reinforcement learning.arXiv preprint arXiv:2602.08234, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Trace2Skill: Distill Trajectory-Local Lessons into Transferable Agent Skills

Jingwei Ni, Yihao Liu, Xinpeng Liu, Yutao Sun, Mengyu Zhou, Pengyu Cheng, Dexin Wang, Xiaoxi Jiang, and Guanjun Jiang. Trace2skill: Distill trajectory-local lessons into transferable agent skills.arXiv preprint arXiv:2603.25158, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

arXiv preprint arXiv:2511.10395 , year=

Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, et al. Agentevolver: Towards efficient self-evolving agent system.arXiv preprint arXiv:2511.10395, 2025

work page arXiv 2025

[29] [29]

A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

Huan-ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, Yiran Wu, et al. A survey of self-evolving agents: On path to artificial super intelligence.arXiv preprint arXiv:2507.21046, 1, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Jinyuan Fang, Yanwen Peng, Xi Zhang, Yingxu Wang, Xinhao Yi, Guibin Zhang, Yi Xu, Bin Wu, Siwei Liu, Zihao Li, et al. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems.arXiv preprint arXiv:2508.07407, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

Dominik Sobania, Martin Briesch, and Franz Rothlauf. Comfygi: Automatic improvement of image generation workflows.arXiv preprint arXiv:2411.14193, 2024

work page arXiv 2024

[32] [32]

Blender.https://www.blender.org, 2026

Blender Foundation. Blender.https://www.blender.org, 2026. Accessed: 2026-04-29

2026

[33] [33]

Houdini.https://www.sidefx.com, 2026

SideFX. Houdini.https://www.sidefx.com, 2026. Accessed: 2026-04-29

2026

[34] [34]

Foundry. Nuke. https://www.foundry.com/products/nuke-family/nuke, 2026. Ac- cessed: 2026-04-29

2026

[35] [35]

Unreal engine

Epic Games. Unreal engine. https://www.unrealengine.com, 2026. Accessed: 2026-04- 29. 12

2026

[36] [36]

ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023

Comfyanonymous. ComfyUI: The most powerful and modular diffusion model gui, api and backend with a graph/nodes interface, 2023. URL https://github.com/comfyanonymous/ ComfyUI. Accessed: 2026-04-29

2023

[37] [37]

Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

Litao Guo, Xinli Xu, Luozhou Wang, Jiantao Lin, Jinsong Zhou, Zixin Zhang, Bolan Su, and Ying-Cong Chen. Comfymind: Toward general-purpose generation via tree-based planning and reactive feedback.arXiv preprint arXiv:2505.17908, 2025

work page arXiv 2025

[38] [38]

Spellburst: A node-based interface for exploratory creative coding with natural language prompts

Tyler Angert, Miroslav Suzara, Jenny Han, Christopher Pondoc, and Hariharan Subramonyam. Spellburst: A node-based interface for exploratory creative coding with natural language prompts. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023

2023

[39] [39]

A visual programming paradigm for abstract deep learning model development

Srikanth G Tamilselvam, Naveen Panwar, Shreya Khare, Rahul Aralikatte, Anush Sankaran, and Senthil Mani. A visual programming paradigm for abstract deep learning model development. InProceedings of the 10th Indian Conference on Human-Computer Interaction, pages 1–11, 2019

2019

[40] [40]

From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

Sai Krishna Revanth Vuruma, Ashley Margetts, Jianhai Su, Faez Ahmed, and Biplav Srivastava. From cloud to edge: Rethinking generative ai for low-resource design challenges.arXiv preprint arXiv:2402.12702, 2024

work page arXiv 2024

[41] [41]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models.arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[42] [42]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, Heather Miller, et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InThe Twelfth International Conference on Learning Representations, 2023

2023

[43] [43]

Agent skills specification

Agent Skills. Agent skills specification. https://agentskills.io/specification, 2026

2026

[44] [44]

Claude code.https://code.claude.com/docs/en/overview, 2025

Anthropic. Claude code.https://code.claude.com/docs/en/overview, 2025

2025

[45] [45]

Hermes agent: The self-improving ai agent, 2026

Nous Research. Hermes agent: The self-improving ai agent, 2026. URL https://github. com/NousResearch/hermes-agent. Accessed: 2026-04-29

2026

[46] [46]

OpenClaw-personal ai assistant, 2026

OpenClaw Contributors. OpenClaw-personal ai assistant, 2026. URL https://github.com/ openclaw/openclaw. Accessed: 2026-04-29

2026

[47] [47]

EvoSkill: Automated Skill Discovery for Multi-Agent Systems

Salaheddin Alzubi, Noah Provenzano, Jaydon Bingham, Weiyuan Chen, and Tu Vu. Evoskill: Automated skill discovery for multi-agent systems.arXiv preprint arXiv:2603.02766, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Co-Evolving LLM Decision and Skill Bank Agents for Long-Horizon Tasks

Xiyang Wu, Zongxia Li, Guangyao Shi, Alexander Duffy, Tyler Marques, Matthew Lyle Olson, Tianyi Zhou, and Dinesh Manocha. Co-evolving llm decision and skill bank agents for long-horizon tasks, 2026. URLhttps://arxiv.org/abs/2604.20987

work page internal anchor Pith review Pith/arXiv arXiv 2026

[49] [49]

DeerFlow: Deep exploration and efficient research flow, 2025

ByteDance. DeerFlow: Deep exploration and efficient research flow, 2025. URL https: //github.com/bytedance/deer-flow. Accessed: 2026-04-29

2025

[50] [50]

XSkill: Continual Learning from Experience and Skills in Multimodal Agents

Guanyu Jiang, Zhaochen Su, Xiaoye Qu, et al. Xskill: Continual learning from experience and skills in multimodal agents.arXiv preprint arXiv:2603.12056, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,

Chengshuai Shi, Wenzhe Li, Xinran Liang, Yizhou Lu, Wenjia Yang, Ruirong Feng, Seth Karten, Ziran Yang, Zihan Ding, Gabriel Sarch, Danqi Chen, Karthik Narasimhan, and Chi Jin. Odysseus: Scaling vlms to 100+ turn decision-making in games via reinforcement learning,

[52] [52]

URLhttps://arxiv.org/abs/2605.00347

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025

Samuel Schmidgall, Yusheng Su, Ze Wang, Ximeng Sun, Jialian Wu, Xiaodong Yu, Jiang Liu, Michael Moor, Zicheng Liu, and Emad Barsoum. Agent laboratory: Using llm agents as research assistants.Findings of the Association for Computational Linguistics: EMNLP 2025, pages 5977–6043, 2025. 13

2025

[54] [54]

arXiv preprint arXiv:2512.16853 (2025)

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025

[55] [55]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[56] [56]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation, 2025

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation.arXiv preprint arxiv:2506.07977, 2025

work page arXiv 2025

[57] [57]

Claude, 2025

Anthropic. Claude, 2025. URL https://www.anthropic.com/claude. Accessed: 2026- 04-30

2025

[58] [58]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

Gemma 4, 2026

Gemma Team, Google DeepMind. Gemma 4, 2026. URL https://ai.google.dev/gemma/ docs/core/model_card_4

2026

[60] [60]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer.arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[62] [62]

Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering

Yushi Hu, Benlin Liu, Jungo Kasai, Yizhong Wang, Mari Ostendorf, Ranjay Krishna, and Noah A Smith. Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20406–20417, 2023

2023

[63] [63]

Evaluating text-to-visual generation with image-to-text generation

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text generation. In European Conference on Computer Vision, pages 366–384. Springer, 2024

2024

[64] [64]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[65] [65]

Realistic snapshot (z-image-turbo)

Civitai Community Model. Realistic snapshot (z-image-turbo). https://civitai.com/ models/2268008/realistic-snapshot-z-image-turbo?modelVersionId=2617751 , 2026

work page arXiv 2026

[66] [66]

photo”, “photograph

Civitai Community Model. enhancer. https://civitai.com/models/2239743/ enhancer?modelVersionId=2521349, 2026. 14 A Predefined Tools, Skills, and LoRA Settings Workflow tools.We expose 17 basic tools for controlling ComfyUI workflows. These tools allow the agent to inspect the current workflow, add and remove nodes, connect nodes, edit node inputs, set pro...

work page arXiv 2026

[67] [67]

Call report_evolution_strategy first: state your plan and the top issue

[68] [68]

Call inspect_workflow to see the current topology

[69] [69]

workflow-builder

**If the workflow is empty** (no nodes): a. Call read_skill("workflow-builder") to load architecture recipes. b. Call query_available_models("checkpoints") and query_available_models("diffusion_models") to discover available models -- NEVER guess filenames. c. Match the model filename to an architecture (SD 1.5, SDXL, Flux, Qwen, etc.) using the patterns ...

[70] [70]

Prompt engineering

**If the workflow already has nodes**, follow the evolution strategy: a. Call set_prompt -- craft a detailed, professional positive prompt AND a strong negative prompt based on the user’s goal (see "Prompt engineering" below). Do this EVERY iteration, even if you also plan structural changes. b. If a relevant skill is listed in <available_skills>, call re...

[71] [71]

Call ‘inspect_workflow‘ to see the FULL current topology and all connections

[72] [72]

Call ‘validate_workflow‘ to get a list of graph errors (dangling refs, wrong slots)

[73] [73]

For each error: - If a node references a nonexistent source -> fix with ‘connect_nodes‘ or ‘delete_node‘ - If a slot index is wrong -> ‘delete_node‘ the broken node and ‘add_node‘ a new one with correct wiring - If a model/filename is wrong -> use ‘query_available_models‘ to get exact names, then ‘set_param‘ - If a node class doesn’t exist -> ‘delete_node...

[74] [74]

Call ‘validate_workflow‘ again to confirm all issues are resolved

[75] [75]

yes" / "no

Call ‘finalize_workflow‘ (it will auto-validate and block if still broken). **IMPORTANT:** Do NOT just add new nodes on top of broken ones -- ‘delete_node‘ the broken node first, then ‘add_node‘ a replacement with correct connections. **Output slot reference:** CheckpointLoaderSimple -> slot 0: MODEL, slot 1: CLIP, slot 2: VAE UNETLoader / CLIPLoader / VA...