MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Cheng Yang; Chen Qian; Chuan Shi; Jinxuan Cai; Qi Meng; Xin Li; Yang Liu; Yishen Li; Zedi Liu

arxiv: 2603.06007 · v2 · pith:QXZTSZ46new · submitted 2026-03-06 · 💻 cs.CL · cs.AI· cs.MA

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

Yang Liu , Jinxuan Cai , Yishen Li , Qi Meng , Zedi Liu , Xin Li , Chen Qian , Chuan Shi

show 1 more author

Cheng Yang

This is my paper

Pith reviewed 2026-05-21 12:38 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.MA

keywords multi-agent systemslarge language modelsgraph-centric orchestrationVibe Graphingnatural language to workflowworkflow specificationhuman-in-the-loopagent collaboration

0 comments

The pith

MASFactory compiles natural-language intent into editable workflow specifications that become executable graphs for multi-agent LLM systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MASFactory, a graph-centric framework for building and running LLM-based multi-agent systems. It centers on Vibe Graphing, a process that takes a user's natural language description of the desired behavior, turns it into a workflow specification that can be inspected and changed, and then converts that specification into a running directed graph where agents or sub-workflows sit at nodes and messages travel along edges. A sympathetic reader would care because existing frameworks force programmers to write substantial custom code for each new workflow, limit component reuse, and make it awkward to bring in outside data sources. By adding reusable components, skill libraries, multimodal message support, pluggable context hooks, and a visualizer for preview and live tracing, the approach aims to reduce manual effort while keeping human oversight in the loop.

Core claim

MASFactory is a graph-centric framework for orchestrating LLM-based multi-agent systems. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. Evaluation on seven public benchmarks confirms both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing.

What carries the argument

Vibe Graphing, the human-in-the-loop compilation step that converts natural-language intent first into an editable workflow specification and then into an executable directed computation graph.

If this is right

Complex multi-agent workflows become expressible through natural language rather than direct code, lowering the effort needed to prototype new agent teams.
Reusable components and skill libraries allow the same agent behaviors to be dropped into multiple different workflows without reimplementation.
Pluggable context sources and multimodal handling let external data of varying types be wired into the graph at runtime.
The visualizer enables inspection of graph topology before execution and tracing of message flow during runs, supporting debugging and adjustment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same compilation-plus-editing pattern could be applied to other graph-based orchestration tasks outside multi-agent systems, such as business process workflows.
Because the specification remains editable after generation, the method may reduce the risk that an LLM misinterprets intent and produces an unrecoverable workflow.
Over time the collection of reusable components could grow into a shared library that accelerates development across research groups.

Load-bearing premise

Natural-language descriptions of user goals can be compiled into workflow specifications that remain accurate and require only modest human edits rather than extensive rewriting.

What would settle it

If Vibe Graphing applied to the seven benchmarks produces specifications that need frequent large-scale corrections to match the performance of hand-written MAS methods, or if the resulting graphs fail to reproduce published results, the central claim would be undermined.

read the original abstract

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MASFactory adds a natural-language front end to graph-based multi-agent orchestration but the evaluation of that step stays thin on numbers.

read the letter

MASFactory gives you a way to describe multi-agent workflows in plain language and get a graph out of it, with editing in between, plus some supporting features for reuse and visualization. What stands out as new is the Vibe Graphing method. It takes natural-language intent, compiles it to an editable workflow specification, and then to an executable graph. The framework also adds reusable components, skill support, multimodal message handling, pluggable context integration, and a visualizer for topology and runtime tracing. This targets the manual effort and limited reuse in existing MAS frameworks. The paper does well by releasing the code on GitHub under Apache 2.0 and including a video demo. It also evaluates reproduction consistency on seven public benchmarks for some representative MAS methods, which shows the orchestration works as expected for those cases. The main soft spot is the lack of quantitative detail on Vibe Graphing. The abstract says they validated its effectiveness, but there are no reported numbers on compilation accuracy, average human edits required, or how well the output preserves the original intent. Without those, it's hard to gauge if this really reduces effort or just moves it to post-editing. The stress-test note correctly flags this gap. This work is for developers and researchers building or experimenting with LLM-based multi-agent systems. Anyone looking for a structured, visual way to orchestrate agents with less boilerplate would find the framework and its components useful. It deserves a serious referee. The implementation is concrete and open, and the approach addresses a practical problem even if the evaluation of the novel part could be stronger. I would recommend sending it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper introduces MASFactory, a graph-centric framework for orchestrating LLM-based multi-agent systems. It models MAS workflows as directed computation graphs and proposes Vibe Graphing, a human-in-the-loop method that compiles natural-language intent into an editable workflow specification and then an executable graph. The framework adds reusable components, skill support, multimodal message handling, pluggable context integration, and a visualizer for topology, tracing, and interaction. Evaluation is performed on seven public benchmarks to demonstrate reproduction consistency for existing MAS methods and the effectiveness of Vibe Graphing, with code and a video demo released under Apache-2.0.

Significance. If Vibe Graphing reliably converts natural-language intent into accurate, low-edit workflow graphs while preserving user goals, the framework could meaningfully lower the manual overhead of building complex MAS workflows, improve component reuse, and ease integration of heterogeneous context sources. The open-source release and visualizer support practical adoption and extension by the community.

major comments (1)

[Evaluation section] Evaluation section (and abstract): The central claim that Vibe Graphing provides an effective human-in-the-loop compilation of natural-language intent into editable workflow specifications rests on validation across seven benchmarks, yet no quantitative metrics are reported for compilation accuracy, success rate without post-compilation intervention, average number of human edits required, or graph fidelity to the original intent. This leaves the reliability of the NL-to-spec step as an untested assumption that directly affects the practical significance of the framework.

minor comments (2)

[Related Work] The related-work discussion would benefit from explicit comparison of Vibe Graphing to prior graph-based MAS orchestration tools (e.g., AutoGen, LangGraph) on dimensions of editability and NL compilation.
[Framework Overview] Figure captions for the visualizer and workflow examples should include concrete examples of input NL intent, generated spec, and final graph to illustrate the compilation pipeline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address the major comment regarding the evaluation of Vibe Graphing below, providing a point-by-point response while proposing concrete revisions to strengthen the paper.

read point-by-point responses

Referee: [Evaluation section] Evaluation section (and abstract): The central claim that Vibe Graphing provides an effective human-in-the-loop compilation of natural-language intent into editable workflow specifications rests on validation across seven benchmarks, yet no quantitative metrics are reported for compilation accuracy, success rate without post-compilation intervention, average number of human edits required, or graph fidelity to the original intent. This leaves the reliability of the NL-to-spec step as an untested assumption that directly affects the practical significance of the framework.

Authors: We acknowledge that the evaluation section primarily reports reproduction consistency across the seven public benchmarks to validate that MASFactory can faithfully implement representative existing MAS methods. The effectiveness of Vibe Graphing is currently supported through its methodological description, the open-source implementation, the visualizer for human-in-the-loop interaction, and the accompanying video demonstration, which collectively illustrate the NL-to-spec compilation process. However, we agree that the manuscript does not provide quantitative metrics specifically measuring compilation accuracy, success rate without post-compilation edits, average number of human edits, or graph fidelity to original intent. This constitutes a genuine gap in directly substantiating the reliability and practical utility of the human-in-the-loop step. In the revised manuscript, we will add a dedicated subsection to the Evaluation section that reports results from additional user studies. These will include quantitative measures such as compilation accuracy against expert-defined ground-truth workflows, the percentage of cases succeeding without intervention, average edit counts per intent, and fidelity scores based on semantic similarity and user ratings. We will also discuss limitations of the current human-in-the-loop design. revision: yes

Circularity Check

0 steps flagged

No circularity: implementation framework with benchmark validation

full rationale

The paper introduces MASFactory as a software framework for LLM-based multi-agent systems, describing Vibe Graphing as a human-in-the-loop compilation process from natural language to workflow graphs, along with reusable components and evaluation on seven public benchmarks for reproduction consistency and effectiveness. No equations, fitted parameters, predictions, or derivations appear in the abstract or described structure that reduce to self-definitions or self-citations by construction. The contribution is an engineering artifact with external benchmark validation, making the central claims independent of any internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that MAS workflows are naturally modeled as directed computation graphs and introduces Vibe Graphing as a new method without additional free parameters or invented physical entities.

axioms (1)

domain assumption MAS workflows can be naturally modeled as directed computation graphs where nodes execute agents or sub-workflows and edges encode dependencies and message passing.
Stated directly in the abstract as the modeling basis for the framework.

invented entities (1)

Vibe Graphing no independent evidence
purpose: Human-in-the-loop compilation of natural-language intent into an editable workflow specification and executable graph.
Newly introduced approach in the paper to address manual effort and reuse limitations.

pith-pipeline@v0.9.0 · 5788 in / 1290 out tokens · 73244 ms · 2026-05-21T12:38:08.120977+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph... three core tasks: Role Assignment, Structure Design, Semantic Completion
IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Graph and Loop... NodeTemplate... ComposedGraph... reusable components... pluggable Context Adapter

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Security Considerations for Multi-agent Systems
cs.CR 2026-03 unverdicted novelty 6.0

No existing AI security framework covers a majority of the 193 identified multi-agent system threats in any category, with OWASP Agentic Security Initiative achieving the highest overall coverage at 65.3%.