GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Ao Li; Fahao Chen; Peng Li; Shangpeng Yang; Tianheng Xu; Zhou Su

arxiv: 2605.22566 · v1 · pith:ORR4VPFKnew · submitted 2026-05-21 · 💻 cs.LG

GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving

Ao Li , Shangpeng Yang , Fahao Chen , Tianheng Xu , Peng Li , Zhou Su This is my paper

Pith reviewed 2026-05-22 07:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM agentsworkflow managementgraph-based workflowsadaptive workflow generationKV cache optimizationmemory-efficient servingagent serving systems

0 comments

The pith

Representing workflows as a unified graph of atomic operations lets LLM agents generate task-specific instructions dynamically and reuse computations for better efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing workflow systems for LLM agents depend on fixed templates and shallow matching that limit generalization to new tasks. The paper introduces wGraph as a single shared graph whose nodes are atomic operations, serving as a base from which specific workflows are built on demand according to task needs. GraphFlow adds adaptive generation from this graph and uses its structure to manage KV caches so that overlapping work across workflows is not recomputed. Tests on five benchmarks show higher success rates and much smaller memory use than prior methods. If the approach holds, agents could handle wider ranges of instructions without hand-crafted templates for each new scenario.

Core claim

The paper claims that a unified graph called wGraph, with each node as an atomic operation, provides a shared substrate from which task-specific workflows are dynamically instantiated based on semantics and constraints, and that exploiting the graph's structure for Key-Value cache management during serving reduces redundant computation, yielding better performance and lower memory use than template-based systems.

What carries the argument

wGraph, the unified graph in which nodes represent atomic operations and from which task-specific workflows are dynamically constructed while guiding KV cache reuse.

If this is right

Task-specific workflows can be assembled from the shared graph without requiring a new template for every novel instruction.
KV cache management guided by graph connections avoids recomputing common operation sequences across different tasks.
The combined designs produce measurable gains in task completion rates while lowering the memory needed to run the agent.
Workflow integration happens inside the serving loop rather than as a separate preprocessing step.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the set of atomic nodes over time could let the system support progressively more complex agent behaviors without redesigning the core infrastructure.
Similar graph substrates might be constructed for planning in non-LLM agents such as robotic controllers if their primitive actions can be enumerated.
The memory savings would likely increase in long multi-turn interactions where many workflow fragments overlap.

Load-bearing premise

A unified graph of atomic operations can capture deep semantic relationships and support generalization to unseen tasks more effectively than predefined templates and shallow matching.

What would settle it

Running GraphFlow on a collection of tasks whose required operations lie entirely outside the atomic nodes defined in the initial wGraph and checking whether accuracy falls to the level of template baselines would test the generalization claim.

Figures

Figures reproduced from arXiv: 2605.22566 by Ao Li, Fahao Chen, Peng Li, Shangpeng Yang, Tianheng Xu, Zhou Su.

**Figure 1.** Figure 1: Structured agentic workflow for complex online shopping. The agent executes a set of atomic operations (e.g., search, filter, review) to fulfill the user query. 1. Introduction Large Language Model (LLM)-based agents (Wang et al., 2025a; Shang et al., 2025) have demonstrated strong potential for complex task execution, including multi-step planning (Fourney et al., 2024), sophisticated tool orchestratio… view at source ↗

**Figure 2.** Figure 2: The overall framework of our proposed GraphFlow. TaskWeaver (Qiao et al., 2023), LLM-Compiler (Kim et al., 2024), and AFlow (Zhang et al., 2025a). Despite these advances, workflow-assisted serving systems continue to face scalability challenges. Retrieval-centric approaches such as Voyager (Wang et al., 2023a) and Skill-Search (Wang et al., 2023b) retrieve predefined plans as independent units and fail to… view at source ↗

**Figure 4.** Figure 4: Example of effective path pruning. under different execution prefixes. For a set of representative operations, we compare KV states computed in isolation and with different preceding operations, and measure element-wise differences across layers and attention heads [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Memory-performance trade-off analysis. Comparison of memory footprint and task performance. 10 20 30 40 50 Batch Size (Concurrent Requests) 0.0 0.5 1.0 1.5 2.0 2.5 Memory (GB) Stateful KV management GraphFlow Stateless KV management [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Ablation study on path pruning. tion under differential-based KV management. We compare GraphFlow with and without path pruning across five benchmarks. Across all tasks, enabling path pruning consistently reduces the KV memory footprint. For example, on GSM8K, memory usage decreases from 15.0 GB to 11.5 GB, and on MBPP from 9.9 GB to 7.2 GB. Similar reductions are observed on MATH and HotpotQA, where pru… view at source ↗

**Figure 8.** Figure 8: Ablation study on GCN depth. We evaluate the performance impact of varying GCN layers (from 1 to 4) across five benchmarks. The results demonstrate that the 2-layer architecture consistently yields optimal performance across different LLM backbones. C.2. Ablation Study on Workflow Generation To verify the effectiveness of the core components in GraphFlow, we conduct a comprehensive ablation study using the… view at source ↗

**Figure 9.** Figure 9: Additional visualizations of residual KV sparsity. We show the element-wise difference heatmaps for keys (∆K) and values (∆V ) at representative layers and attention heads, comparing contextualized KV states against the context-free base cache. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

read the original abstract

Large Language Model (LLM)-based agents demonstrate strong reasoning and execution capabilities on complex tasks when guided by structured instructions, commonly referred to as workflows. However, existing workflow-assisted agent serving systems typically rely on predefined templates and shallow matching mechanisms, which limit their ability to capture deep semantic relationships and generalize to previously unseen tasks. To address these limitations, we propose a new workflow management paradigm that represents workflows using a unified graph, termed wGraph, where each node corresponds to an atomic operation. wGraph serves as a shared substrate from which task-specific workflows are dynamically instantiated. Building on wGraph primitives, we introduce GraphFlow, a system that efficiently integrates workflows into agent serving through two key designs. First, adaptive workflow generation dynamically constructs workflows from wGraph based on task semantics and constraint requirements. Second, workflow state management exploits wGraph structure to efficiently manage Key-Value (KV) caches, reducing redundant computation during agent serving. Extensive experiments across five benchmark datasets show that GraphFlow consistently outperforms state-of-the-art methods, yielding an average performance improvement of approximately 4.95 percentage points, while achieving an approximately 4$\times$ reduction in memory footprint.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GraphFlow uses a shared wGraph to dynamically build LLM agent workflows and manage KV caches, claiming modest gains over templates but with thin experimental detail so far.

read the letter

The main thing to know is that this paper replaces fixed workflow templates with a graph substrate called wGraph. Nodes are atomic operations, and the system builds task-specific workflows on the fly while using the same graph to prune redundant KV cache entries during serving. That combination is the concrete novelty over prior shallow matching approaches. The design targets two practical pain points: generalization to unseen tasks and memory bloat in agent inference loops. If the graph really lets the system compose new sequences without hand-crafted templates, it could simplify deployment for people running multi-step agents at scale. The reported numbers—an average 4.95 point lift and roughly 4× lower memory across five benchmarks—are the sort of result that would matter for serving systems if they hold up. The KV-cache trick in particular feels like a direct engineering win rather than a theoretical claim. On the soft side, the abstract gives almost no information on how adaptive generation actually selects or constrains the instantiated graph, what the baselines were, or whether the gains survive ablations on the graph structure itself. Without those controls it is easy to imagine the improvements coming from better prompt engineering or implementation details instead of the wGraph idea. The assumption that atomic nodes capture deep semantics for truly novel tasks also looks optimistic until we see failure cases or out-of-distribution results. This is a systems paper for readers who already work on LLM agent serving or workflow orchestration. Anyone tuning inference stacks or building production agents could extract usable ideas from the cache management piece. It is coherent enough on its own terms to deserve a serious referee who can check the experimental setup and any released code.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes GraphFlow, a graph-based workflow management system for efficient LLM-agent serving. Workflows are represented via a unified graph (wGraph) whose nodes are atomic operations; task-specific workflows are dynamically instantiated from this shared substrate. The system introduces two main designs: adaptive workflow generation that constructs workflows from wGraph according to task semantics and constraints, and workflow state management that exploits wGraph structure for KV-cache reuse to reduce redundant computation. Experiments on five benchmark datasets report an average performance gain of approximately 4.95 percentage points and an approximately 4× reduction in memory footprint relative to state-of-the-art methods.

Significance. If the reported gains are reproducible under standard controls, the work offers a practical advance in LLM-agent serving by replacing rigid template-based workflows with a more flexible, graph-structured substrate that supports better generalization. The KV-cache exploitation mechanism is a concrete engineering contribution that directly targets memory efficiency in long-horizon agent execution.

major comments (1)

[Abstract and experimental evaluation] Abstract and experimental evaluation: the central claim of consistent outperformance (4.95 pp average gain) and 4× memory reduction is presented without any description of baselines, statistical significance testing, variance across runs, or implementation details of the adaptive generation procedure. This information is load-bearing for assessing whether the gains are attributable to the proposed wGraph design rather than experimental artifacts.

minor comments (1)

The term 'wGraph' is introduced without an accompanying formal definition or illustrative diagram in the main text; a small example graph with node/edge semantics would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper to improve clarity and rigor in the presentation of results.

read point-by-point responses

Referee: [Abstract and experimental evaluation] Abstract and experimental evaluation: the central claim of consistent outperformance (4.95 pp average gain) and 4× memory reduction is presented without any description of baselines, statistical significance testing, variance across runs, or implementation details of the adaptive generation procedure. This information is load-bearing for assessing whether the gains are attributable to the proposed wGraph design rather than experimental artifacts.

Authors: We agree that the abstract and experimental evaluation would benefit from greater specificity to allow readers to fully assess the source of the reported gains. In the revised manuscript we will (1) expand the abstract to name the concrete state-of-the-art baselines used for comparison, (2) add statistical significance testing (paired t-tests with p-values) and report standard deviations or confidence intervals across repeated runs in the main results tables, and (3) insert a concise but explicit description of the adaptive workflow generation procedure, including the core algorithm, semantic matching criteria, and key hyperparameters. These additions will be placed in both the abstract and the experimental section so that the attribution of improvements to the wGraph substrate is transparent. We do not believe any new experiments are required; the existing data already support the claims once the missing details are supplied. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an engineering system (wGraph as shared substrate, adaptive workflow generation from task semantics, and KV-cache management exploiting graph structure) whose performance claims are presented as direct outcomes of experiments on five benchmark datasets. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. The central claims rest on empirical evaluation rather than reducing to inputs by construction, self-citation chains, or ansatz smuggling. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that workflows decompose cleanly into atomic operations representable in a single graph and that task semantics can be matched to graph substructures without additional learned parameters.

axioms (1)

domain assumption Workflows can be represented as graphs with atomic operations as nodes that serve as a shared substrate for dynamic instantiation.
This premise underpins both adaptive generation and state management designs.

invented entities (1)

wGraph no independent evidence
purpose: Unified graph representation of workflows
New structure introduced to replace predefined templates.

pith-pipeline@v0.9.0 · 5742 in / 1175 out tokens · 43612 ms · 2026-05-22T07:14:26.334774+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

wGraph ... each node corresponds to an atomic operation ... task-specific workflows are dynamically instantiated ... GNN-based representation learning ... differential-based KV cache ... KV(P, v) = KVbase(v) + ΔKV(P, v)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

topology-aware state management ... effective path pruning

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 14 internal anchors

[1]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Workflowllm: Enhanc- ing workflow orchestration capability of large language models

Fan, S., Cong, X., Fu, Y ., Zhang, Z., Zhang, S., Liu, Y ., Wu, Y ., Lin, Y ., Liu, Z., and Sun, M. Workflowllm: Enhanc- ing workflow orchestration capability of large language models. InInternational Conference on Learning Repre- sentations, volume 2025, pp. 24498–24525,

work page 2025
[5]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Fourney, A., Bansal, G., Mozannar, H., Tan, C., Salinas, E., Niedtner, F., Proebsting, G., Bassman, G., Gerrits, J., Alber, J., et al. Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Metagpt: Meta programming for a multi-agent collaborative frame- work

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y ., Wang, J., Zhang, C., Yau, S., Lin, Z., Zhou, L., et al. Metagpt: Meta programming for a multi-agent collaborative frame- work. InInternational Conference on Learning Repre- sentations, volume 2024, pp. 23247–23275,

work page 2024
[8]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., and Poole, B. Categorical repa- rameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

P., Yao, Y ., Wei, J., Paul, D., et al

Josifoski, M., Klein, L., Peyrard, M., Baldwin, N., Li, Y ., Geng, S., Schnitzler, J. P., Yao, Y ., Wei, J., Paul, D., et al. Flows: Building blocks of reasoning and collaborating ai. arXiv preprint arXiv:2308.01285,

work page arXiv
[10]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Haq, S., Sharma, A., Joshi, T., Moazam, H., Miller, H., et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InInterna- tional Conference on Learning Representations, volume 2024, pp. 54928–54958,

work page 2024
[11]

Kipf, T. N. and Welling, M. Semi-supervised classifica- tion with graph convolutional networks.arXiv preprint arXiv:1609.02907,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Autoflow: Automated workflow generation for large language model agents

Li, Z., Xu, S., Mei, K., Hua, W., Rama, B., Raheja, O., Wang, H., Zhu, H., and Zhang, Y . Autoflow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,

work page arXiv
[13]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

MemGPT: Towards LLMs as Operating Systems

Packer, C., Wooders, S., Lin, K., Fang, V ., Patil, S. G., Stoica, I., and Gonzalez, J. E. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

Qiao, B., Li, L., Zhang, X., He, S., Kang, Y ., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

work page arXiv
[16]

Benchmarking agen- tic workflow generation

Qiao, S., Fang, R., Qiu, Z., Wang, X., Zhang, N., Jiang, Y ., Xie, P., Huang, F., and Chen, H. Benchmarking agen- tic workflow generation. InInternational Conference on Learning Representations, volume 2025, pp. 69679– 69703,

work page 2025
[17]

Agentsquare: Automatic llm agent search in modular design space

Shang, Y ., Li, Y ., Zhao, K., Ma, L., Liu, J., Xu, F., and Li, Y . Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, volume 2025, pp. 3841–3865,

work page 2025
[18]

Flowmesh: A service fabric for composable llm workflows.arXiv preprint arXiv:2510.26913,

Shen, J., Wadlom, N., Zhou, L., Wang, D., Miao, X., Fang, L., and Lu, Y . Flowmesh: A service fabric for composable llm workflows.arXiv preprint arXiv:2510.26913,

work page arXiv
[19]

Agent kb: Leveraging cross-domain experience for agentic problem solving

Tang, X., Qin, T., Peng, T., Zhou, Z., Shao, D., Du, T., Wei, X., Xia, P., Wu, F., Zhu, H., et al. Agent kb: Leveraging cross-domain experience for agentic problem solving. arXiv preprint arXiv:2507.06229,

work page arXiv
[20]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023a. Wang, J., Xu, H., Jia, H., Zhang, X., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent-v2: Mo- bile device operation assistant with effec...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

A survey of llm-based agents in medicine: How far are we from baymax?Findings of the Association for Computational Linguistics: ACL 2025, pp

Wang, W., Ma, Z., Wang, Z., Wu, C., Ji, J., Chen, W., Li, X., and Yuan, Y . A survey of llm-based agents in medicine: How far are we from baymax?Findings of the Association for Computational Linguistics: ACL 2025, pp. 10345–10359, 2025a. Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent work- flow memory. InInternational Conference on Machine Learning...

work page 2025
[23]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Wu, Y ., Yue, T., Zhang, S., Wang, C., and Wu, Q. State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

work page arXiv
[24]

Flowbench: Revisiting and bench- marking workflow-guided planning for llm-based agents

Xiao, R., Ma, W., Wang, K., Wu, Y ., Zhao, J., Wang, H., Huang, F., and Li, Y . Flowbench: Revisiting and bench- marking workflow-guided planning for llm-based agents. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 10883–10900,

work page 2024
[25]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

work page 2018
[27]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023a

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023a. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in lang...

work page arXiv
[28]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Zhu, X., Chen, Y ., Tian, H., Tao, C., Su, W., Yang, C., Huang, G., Li, B., Lu, L., Wang, X., et al. Ghost in the minecraft: Generally capable agents for open- world environments via large language models with text-based knowledge and memory.arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

11 GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving A. Experimental and Implementation Details Data Preparation and Graph Construction.To construct the supervision dataset, we leverage GPT-4o to synthesize high-quality execution traces for queries in the training corpus. These traces are parsed to extract atomic operations and ...

work page 2020
[31]

Edge selection is approximated as: ˜si,j =σ ωij +g ij τ , g ij ∼Gumbel(0,1),(12) where τ is a temperature hyperparameter

during training. Edge selection is approximated as: ˜si,j =σ ωij +g ij τ , g ij ∼Gumbel(0,1),(12) where τ is a temperature hyperparameter. This relaxation bridges the gap between the discrete graph topology and continuous gradient updates (Fu et al., 2026; Wang et al., 2026). Inference: Constrained Decoding.During inference, we bypass the stochastic relax...

work page 2026

[1] [1]

Program Synthesis with Large Language Models

Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Workflowllm: Enhanc- ing workflow orchestration capability of large language models

Fan, S., Cong, X., Fu, Y ., Zhang, Z., Zhang, S., Liu, Y ., Wu, Y ., Lin, Y ., Liu, Z., and Sun, M. Workflowllm: Enhanc- ing workflow orchestration capability of large language models. InInternational Conference on Learning Repre- sentations, volume 2025, pp. 24498–24525,

work page 2025

[5] [5]

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Fourney, A., Bansal, G., Mozannar, H., Tan, C., Salinas, E., Niedtner, F., Proebsting, G., Bassman, G., Gerrits, J., Alber, J., et al. Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Metagpt: Meta programming for a multi-agent collaborative frame- work

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y ., Wang, J., Zhang, C., Yau, S., Lin, Z., Zhou, L., et al. Metagpt: Meta programming for a multi-agent collaborative frame- work. InInternational Conference on Learning Repre- sentations, volume 2024, pp. 23247–23275,

work page 2024

[8] [8]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., and Poole, B. Categorical repa- rameterization with gumbel-softmax.arXiv preprint arXiv:1611.01144,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

P., Yao, Y ., Wei, J., Paul, D., et al

Josifoski, M., Klein, L., Peyrard, M., Baldwin, N., Li, Y ., Geng, S., Schnitzler, J. P., Yao, Y ., Wei, J., Paul, D., et al. Flows: Building blocks of reasoning and collaborating ai. arXiv preprint arXiv:2308.01285,

work page arXiv

[10] [10]

Dspy: compiling declarative language model calls into state-of-the-art pipelines

Khattab, O., Singhvi, A., Maheshwari, P., Zhang, Z., San- thanam, K., Haq, S., Sharma, A., Joshi, T., Moazam, H., Miller, H., et al. Dspy: compiling declarative language model calls into state-of-the-art pipelines. InInterna- tional Conference on Learning Representations, volume 2024, pp. 54928–54958,

work page 2024

[11] [11]

Kipf, T. N. and Welling, M. Semi-supervised classifica- tion with graph convolutional networks.arXiv preprint arXiv:1609.02907,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Autoflow: Automated workflow generation for large language model agents

Li, Z., Xu, S., Mei, K., Hua, W., Rama, B., Raheja, O., Wang, H., Zhu, H., and Zhang, Y . Autoflow: Automated workflow generation for large language model agents. arXiv preprint arXiv:2407.12821,

work page arXiv

[13] [13]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

MemGPT: Towards LLMs as Operating Systems

Packer, C., Wooders, S., Lin, K., Fang, V ., Patil, S. G., Stoica, I., and Gonzalez, J. E. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

Qiao, B., Li, L., Zhang, X., He, S., Kang, Y ., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al. Taskweaver: A code-first agent framework.arXiv preprint arXiv:2311.17541,

work page arXiv

[16] [16]

Benchmarking agen- tic workflow generation

Qiao, S., Fang, R., Qiu, Z., Wang, X., Zhang, N., Jiang, Y ., Xie, P., Huang, F., and Chen, H. Benchmarking agen- tic workflow generation. InInternational Conference on Learning Representations, volume 2025, pp. 69679– 69703,

work page 2025

[17] [17]

Agentsquare: Automatic llm agent search in modular design space

Shang, Y ., Li, Y ., Zhao, K., Ma, L., Liu, J., Xu, F., and Li, Y . Agentsquare: Automatic llm agent search in modular design space. InInternational Conference on Learning Representations, volume 2025, pp. 3841–3865,

work page 2025

[18] [18]

Flowmesh: A service fabric for composable llm workflows.arXiv preprint arXiv:2510.26913,

Shen, J., Wadlom, N., Zhou, L., Wang, D., Miao, X., Fang, L., and Lu, Y . Flowmesh: A service fabric for composable llm workflows.arXiv preprint arXiv:2510.26913,

work page arXiv

[19] [19]

Agent kb: Leveraging cross-domain experience for agentic problem solving

Tang, X., Qin, T., Peng, T., Zhou, Z., Shao, D., Du, T., Wei, X., Xia, P., Wu, F., Zhu, H., et al. Agent kb: Leveraging cross-domain experience for agentic problem solving. arXiv preprint arXiv:2507.06229,

work page arXiv

[20] [20]

Gemma 2: Improving Open Language Models at a Practical Size

Team, G., Riviere, M., Pathak, S., Sessa, P. G., Hardin, C., Bhupatiraju, S., Hussenot, L., Mesnard, T., Shahri- ari, B., Ram ´e, A., et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Wang, G., Xie, Y ., Jiang, Y ., Mandlekar, A., Xiao, C., Zhu, Y ., Fan, L., and Anandkumar, A. V oyager: An open- ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023a. Wang, J., Xu, H., Jia, H., Zhang, X., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent-v2: Mo- bile device operation assistant with effec...

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

A survey of llm-based agents in medicine: How far are we from baymax?Findings of the Association for Computational Linguistics: ACL 2025, pp

Wang, W., Ma, Z., Wang, Z., Wu, C., Ji, J., Chen, W., Li, X., and Yuan, Y . A survey of llm-based agents in medicine: How far are we from baymax?Findings of the Association for Computational Linguistics: ACL 2025, pp. 10345–10359, 2025a. Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent work- flow memory. InInternational Conference on Machine Learning...

work page 2025

[23] [23]

State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

Wu, Y ., Yue, T., Zhang, S., Wang, C., and Wu, Q. State- flow: Enhancing llm task-solving through state-driven workflows.arXiv preprint arXiv:2403.11322,

work page arXiv

[24] [24]

Flowbench: Revisiting and bench- marking workflow-guided planning for llm-based agents

Xiao, R., Ma, W., Wang, K., Wu, Y ., Zhao, J., Wang, H., Huang, F., and Li, Y . Flowbench: Revisiting and bench- marking workflow-guided planning for llm-based agents. InFindings of the Association for Computational Linguis- tics: EMNLP 2024, pp. 10883–10900,

work page 2024

[25] [25]

Qwen2.5 Technical Report

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Yang, Z., Qi, P., Zhang, S., Bengio, Y ., Cohen, W., Salakhut- dinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing, pp. 2369–2380,

work page 2018

[27] [27]

Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023a

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023a. Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y . React: Synergizing reasoning and acting in lang...

work page arXiv

[28] [28]

Language Agent Tree Search Unifies Reasoning Acting and Planning in Language Models

Zhou, A., Yan, K., Shlapentokh-Rothman, M., Wang, H., and Wang, Y .-X. Language agent tree search unifies reasoning acting and planning in language models.arXiv preprint arXiv:2310.04406,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory

Zhu, X., Chen, Y ., Tian, H., Tao, C., Su, W., Yang, C., Huang, G., Li, B., Lu, L., Wang, X., et al. Ghost in the minecraft: Generally capable agents for open- world environments via large language models with text-based knowledge and memory.arXiv preprint arXiv:2305.17144,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

11 GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving A. Experimental and Implementation Details Data Preparation and Graph Construction.To construct the supervision dataset, we leverage GPT-4o to synthesize high-quality execution traces for queries in the training corpus. These traces are parsed to extract atomic operations and ...

work page 2020

[31] [31]

Edge selection is approximated as: ˜si,j =σ ωij +g ij τ , g ij ∼Gumbel(0,1),(12) where τ is a temperature hyperparameter

during training. Edge selection is approximated as: ˜si,j =σ ωij +g ij τ , g ij ∼Gumbel(0,1),(12) where τ is a temperature hyperparameter. This relaxation bridges the gap between the discrete graph topology and continuous gradient updates (Fu et al., 2026; Wang et al., 2026). Inference: Constrained Decoding.During inference, we bypass the stochastic relax...

work page 2026