Composing Verifiable Conceptual Models via Building Blocks: Towards Design-Time Verification of Agentic AI Workflows
Pith reviewed 2026-06-26 14:08 UTC · model grok-4.3
The pith
Agentic AI workflows can be checked for design flaws at composition time using twelve structural rules on reusable building blocks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Modeling agentic workflows as compositions of reusable building blocks and checking their compatibility through twelve structural rules produces a verifier that reliably detects design flaws even when those flaws have been obscured by structural transformations such as splitting tasks between agents.
What carries the argument
Twelve structural rules that encode compatibility constraints when building blocks are assembled into agentic workflows.
If this is right
- The verifier can be packaged as a software prototype that operates before any agents are executed.
- Detection remains effective after common structural edits that preserve the underlying workflow logic.
- Community repositories of verified building blocks could be assembled into safe workflows using the same rules.
- The approach mirrors design-time verification practices already used for conceptual models in modeling and simulation.
Where Pith is reading between the lines
- The same rule set might be extended to new classes of building blocks without rewriting the entire verifier.
- Integration with existing workflow editors could surface violations while a designer is still dragging blocks together.
- If the rules prove incomplete, the evaluation method using transformed variants offers a way to test any proposed additions.
Load-bearing premise
The twelve structural rules capture every relevant compatibility issue that can arise when composing agentic workflows from the chosen building blocks.
What would settle it
A collection of agentic workflows in which a compatibility violation exists that is not caught by any of the twelve rules.
Figures
read the original abstract
Agentic AI systems orchestrate multiple LLM-based agents through workflow architectures that coordinate decisions, tools, and external actions. While current platforms emphasize runtime safeguards, little support exists for verifying workflows during system design. From a Modeling \& Simulation perspective, this gap is analogous to composing conceptual models without verifying whether their building blocks interact coherently. We propose a design-time verification approach that models agentic workflows as compositions of reusable building blocks and checks their compatibility through twelve structural rules. We implemented these rules in a software prototype and evaluated them using two openly released datasets: 48 workflows with known design flaws and 168 variants that preserve workflow logic but alter graph structure. Results show that our verifier reliably detects violations even when flawed designs are obscured through structural transformations such as splitting tasks between agents. Future works could combine our verification with community repositories of building blocks to compose safe agentic workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a design-time verification approach for agentic AI workflows by modeling them as compositions of reusable building blocks checked for compatibility via twelve structural rules. A prototype implements the rules and is evaluated on two open datasets consisting of 48 workflows with known design flaws and 168 logic-preserving structural variants; the central claim is that the verifier reliably detects violations even after transformations such as splitting tasks between agents.
Significance. If the rule set is shown to be independently justified and the evaluation free of circularity, the work could provide a practical contribution to early detection of compatibility issues in agentic systems, extending ideas from conceptual model verification in modeling and simulation. The release of open datasets and explicit testing against graph rewrites are strengths that support reproducibility and robustness claims.
major comments (3)
- [Abstract and Evaluation section] Abstract and Evaluation section: The claim that the verifier 'reliably detects violations' rests on 48 workflows labeled with 'known design flaws.' The manuscript provides no description of how these flaws were identified or whether identification was performed independently of the twelve proposed rules. If the base cases were constructed or labeled precisely by violation of those rules, successful detection is expected by construction and supplies no independent evidence that the rules catch real compatibility problems.
- [Section describing the twelve structural rules (likely §3)] Section describing the twelve structural rules (likely §3): No derivation, formal justification, completeness argument, or comparison to alternative rule sets is supplied for the twelve rules. The central claim that the rules detect 'all relevant compatibility issues' when composing workflows therefore rests on an unevaluated assumption rather than demonstrated soundness.
- [Evaluation section] Evaluation section: No error analysis, false-positive/negative discussion, or external ground truth (e.g., runtime execution logs or independent expert labeling) is reported. The 168-variant test demonstrates robustness to structural rewrites but inherits any circularity present in the 48 base cases.
minor comments (1)
- [Abstract] The abstract refers to future combination with 'community repositories of building blocks' without citing existing repositories or related work on workflow component libraries.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the independence of the evaluation and the justification of the rule set. We address each major comment below and indicate planned revisions to improve clarity and transparency.
read point-by-point responses
-
Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The claim that the verifier 'reliably detects violations' rests on 48 workflows labeled with 'known design flaws.' The manuscript provides no description of how these flaws were identified or whether identification was performed independently of the twelve proposed rules. If the base cases were constructed or labeled precisely by violation of those rules, successful detection is expected by construction and supplies no independent evidence that the rules catch real compatibility problems.
Authors: The manuscript does not describe the flaw identification process for the 48 workflows. The base cases were assembled from representative agentic workflow patterns, with flaws corresponding to violations of the structural rules. We acknowledge that this creates dependence on the rule set and limits claims of fully independent validation. In the revised manuscript we will add an explicit description of dataset construction in the Evaluation section, including the source patterns used and the manual labeling criteria, and will qualify the strength of the evidence accordingly. revision: yes
-
Referee: [Section describing the twelve structural rules (likely §3)] Section describing the twelve structural rules (likely §3): No derivation, formal justification, completeness argument, or comparison to alternative rule sets is supplied for the twelve rules. The central claim that the rules detect 'all relevant compatibility issues' when composing workflows therefore rests on an unevaluated assumption rather than demonstrated soundness.
Authors: The manuscript supplies no derivation, formal justification, or completeness argument for the twelve rules. The rules were obtained by enumerating recurring compatibility failures observed when composing agentic building blocks (interface mismatches, dependency cycles, resource conflicts, etc.). We do not claim the set is complete or exhaustive. In revision we will insert a dedicated subsection in §3 that states the origin of each rule, provides a short rationale drawn from conceptual modeling principles, and explicitly notes the absence of a completeness proof or comparison against alternative rule sets. revision: yes
-
Referee: [Evaluation section] Evaluation section: No error analysis, false-positive/negative discussion, or external ground truth (e.g., runtime execution logs or independent expert labeling) is reported. The 168-variant test demonstrates robustness to structural rewrites but inherits any circularity present in the 48 base cases.
Authors: The current Evaluation section contains no error analysis, false-positive/negative rates, or external ground truth. The 168-variant experiment tests invariance of the rules under logic-preserving graph rewrites and is therefore independent of the base-case labeling, yet the overall detection claim still rests on the 48 workflows. We will expand the section with a brief discussion of possible false positives (overly strict rules on valid but unconventional compositions) and will add a limitations paragraph noting the lack of runtime logs or independent expert labels. These additions will be marked as directions for future validation. revision: partial
Circularity Check
No circularity; evaluation uses external datasets
full rationale
The paper proposes twelve structural rules for design-time verification of agentic workflows, implements them in a prototype, and evaluates detection on two openly released datasets (48 workflows with known design flaws plus 168 logic-preserving variants). No equations, fitted parameters, self-citations, or uniqueness theorems appear in the derivation. The central claim of reliable detection does not reduce to quantities or labels defined by the authors' own prior work or by construction from the rules themselves; the datasets are presented as external. This is the most common honest non-finding for a rule-based checker.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Twelve structural rules suffice to detect incompatibilities in compositions of agentic workflow building blocks
invented entities (1)
-
Reusable building blocks for agentic workflows
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems
“Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems”.arXiv preprint arXiv:2512.12791. Andriushchenko, M., A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang,et al
-
[2]
On the Opportunities and Risks of Foundation Models
“On the Opportunities and Risks of Foundation Models”.arXiv(2108.07258). Borghoff, U. M., P. Bottoni, and R. Pareschi
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Why Do Multi-Agent LLM Systems Fail?
“Why Do Multi-Agent LLM Systems Fail?”.arXiv(2503.13657). Cheng, S., P. J. Giabbanelli, and Z. Kuang
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Compositional assume-guarantee reasoning for input/output component theories
“Compositional assume-guarantee reasoning for input/output component theories”.Science of Computer Programming91:115–137 https://doi.org/10.1016/j.scico.2013.12.010. Debenedetti, E., J. Zhang, M. Balunovicet al
-
[5]
Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin
“Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin”. Accessed: 2026-02-22. DOMO
2026
-
[6]
“A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges”.arXiv preprint arXiv:2602.05883. Ironclad
-
[7]
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
“Dspy: Compiling declarative language model calls into self-improving pipelines”.arXiv preprint arXiv:2310.03714. Knowlton, B., J. Campa, D. S. Gallo, K. Dajani, and N. Alzahrani
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
“Granularity of conflicts and dependencies in graph transformation systems: A two-dimensional approach”.Journal of Logical and Algebraic Methods in Programming103:105– 129 https://doi.org/10.1016/j.jlamp.2018.11.004. LangChain
-
[9]
Guardians of the Agents
“Guardians of the Agents”.Communications of the ACM69(1):46–52. Meta 2025a. “Boosting Your Support and Safety on Meta’s Apps With AI”. https://about.fb.com/news/2026/03/ boosting-your-support-and-safety-on-metas-apps-with-ai/. Meta 2025b. “Community Standards”. https://transparency.meta.com/policies/community-standards/. Microsoft Research
2026
-
[10]
Flow: A Modular Approach to Automated Agentic Workflow Generation
“Flow: A Modular Approach to Automated Agentic Workflow Generation”.arXiv preprint arXiv:2501.07834. Page, E. H., and J. M. Opper
-
[11]
Why Do Multiagent Systems Fail?
“Why Do Multiagent Systems Fail?”. In ICLR 2025 Workshop on Building Trust in Language Models and Applications. Parler
2025
-
[12]
“NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”.arXiv preprint arXiv:2310.10501. Ruan, Y ., H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba,et al
-
[13]
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
“Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses”.arXiv(2504.02080). Wang, X.et al
-
[14]
SoK: Evaluating Jailbreak Guardrails for Large Language Models
“SoK: Evaluating Jailbreak Guardrails for Large Language Models”.arXiv(2506.10597). Yao, S., J. Zhao, D. Yuet al
-
[15]
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
“Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents”. InFindings of the Association for Computational Linguistics: ACL 2024, 10471–10506. AUTHOR BIOGRAPHIES NOÉ FLANDREholds a Master’s degree in Computer Science and is an engineering graduate from IMT Mines Alès (France). His research interests lie at the in...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.