Composing Verifiable Conceptual Models via Building Blocks: Towards Design-Time Verification of Agentic AI Workflows

Alexander C. Nwala; Noe Y. Flandre; Philippe J. Giabbanelli

arxiv: 2606.21565 · v1 · pith:DKEMIHIDnew · submitted 2026-06-19 · 💻 cs.AI

Composing Verifiable Conceptual Models via Building Blocks: Towards Design-Time Verification of Agentic AI Workflows

Noe Y. Flandre , Alexander C. Nwala , Philippe J. Giabbanelli This is my paper

Pith reviewed 2026-06-26 14:08 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic AIworkflow verificationdesign-time verificationbuilding blocksLLM agentsstructural rulesconceptual modelscompatibility checking

0 comments

The pith

Agentic AI workflows can be checked for design flaws at composition time using twelve structural rules on reusable building blocks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models agentic AI systems as compositions of reusable building blocks whose interactions are governed by twelve structural rules. It implements these rules in a prototype verifier and tests it on 48 workflows containing known flaws plus 168 variants that hide the same flaws through graph changes such as task splitting. The results indicate that the rules continue to flag violations after such transformations. A sympathetic reader would see this as supplying a missing design-time check that current runtime-focused platforms lack, analogous to verifying conceptual models before simulation.

Core claim

Modeling agentic workflows as compositions of reusable building blocks and checking their compatibility through twelve structural rules produces a verifier that reliably detects design flaws even when those flaws have been obscured by structural transformations such as splitting tasks between agents.

What carries the argument

Twelve structural rules that encode compatibility constraints when building blocks are assembled into agentic workflows.

If this is right

The verifier can be packaged as a software prototype that operates before any agents are executed.
Detection remains effective after common structural edits that preserve the underlying workflow logic.
Community repositories of verified building blocks could be assembled into safe workflows using the same rules.
The approach mirrors design-time verification practices already used for conceptual models in modeling and simulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rule set might be extended to new classes of building blocks without rewriting the entire verifier.
Integration with existing workflow editors could surface violations while a designer is still dragging blocks together.
If the rules prove incomplete, the evaluation method using transformed variants offers a way to test any proposed additions.

Load-bearing premise

The twelve structural rules capture every relevant compatibility issue that can arise when composing agentic workflows from the chosen building blocks.

What would settle it

A collection of agentic workflows in which a compatibility violation exists that is not caught by any of the twelve rules.

Figures

Figures reproduced from arXiv: 2606.21565 by Alexander C. Nwala, Noe Y. Flandre, Philippe J. Giabbanelli.

**Figure 2.** Figure 2: Reference workflows for the three case studies. Solid blue arrows show control or plan flow; dashed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Starting from three complementary agentic AI use cases, we formalize workflow semantics and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Screenshot of our software. A demonstration is available at [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

read the original abstract

Agentic AI systems orchestrate multiple LLM-based agents through workflow architectures that coordinate decisions, tools, and external actions. While current platforms emphasize runtime safeguards, little support exists for verifying workflows during system design. From a Modeling \& Simulation perspective, this gap is analogous to composing conceptual models without verifying whether their building blocks interact coherently. We propose a design-time verification approach that models agentic workflows as compositions of reusable building blocks and checks their compatibility through twelve structural rules. We implemented these rules in a software prototype and evaluated them using two openly released datasets: 48 workflows with known design flaws and 168 variants that preserve workflow logic but alter graph structure. Results show that our verifier reliably detects violations even when flawed designs are obscured through structural transformations such as splitting tasks between agents. Future works could combine our verification with community repositories of building blocks to compose safe agentic workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete set of twelve rules for catching structural problems when composing agentic workflows from building blocks, plus a test on transformed variants, but the evaluation looks dependent on how the 'known flaws' were chosen.

read the letter

The core contribution is a design-time checker that treats agentic workflows as compositions of reusable blocks and applies twelve fixed structural rules to flag incompatibilities. They built a prototype and ran it on two open datasets: 48 workflows already labeled as flawed and 168 logic-preserving structural variants. The claim is that detection still works after rewrites like splitting tasks across agents.

What stands out is the focus on the transformation test. Most runtime tools do not address this, so showing the rules survive graph changes adds a useful dimension. The datasets are released, which helps reproducibility.

The soft spot is the evaluation setup. The abstract says the workflows have 'known design flaws' but gives no account of how those flaws were identified or whether the labeling was independent of the twelve rules. If the base cases were constructed or tagged precisely because they break the rules, then 'reliable detection' follows by design and does not test whether the rules catch real problems that would surface at runtime. The variant results inherit the same base. There is also no derivation for why these twelve rules and no comparison against other rule sets or expert review.

This is aimed at researchers and engineers building agentic systems who need early checks before deployment. A reader already working on workflow verification or AI safety tooling would get the most from it.

The paper deserves a serious referee to examine the rule justification and the independence of the test cases. The idea is straightforward enough that review would clarify whether the empirical support holds.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a design-time verification approach for agentic AI workflows by modeling them as compositions of reusable building blocks checked for compatibility via twelve structural rules. A prototype implements the rules and is evaluated on two open datasets consisting of 48 workflows with known design flaws and 168 logic-preserving structural variants; the central claim is that the verifier reliably detects violations even after transformations such as splitting tasks between agents.

Significance. If the rule set is shown to be independently justified and the evaluation free of circularity, the work could provide a practical contribution to early detection of compatibility issues in agentic systems, extending ideas from conceptual model verification in modeling and simulation. The release of open datasets and explicit testing against graph rewrites are strengths that support reproducibility and robustness claims.

major comments (3)

[Abstract and Evaluation section] Abstract and Evaluation section: The claim that the verifier 'reliably detects violations' rests on 48 workflows labeled with 'known design flaws.' The manuscript provides no description of how these flaws were identified or whether identification was performed independently of the twelve proposed rules. If the base cases were constructed or labeled precisely by violation of those rules, successful detection is expected by construction and supplies no independent evidence that the rules catch real compatibility problems.
[Section describing the twelve structural rules (likely §3)] Section describing the twelve structural rules (likely §3): No derivation, formal justification, completeness argument, or comparison to alternative rule sets is supplied for the twelve rules. The central claim that the rules detect 'all relevant compatibility issues' when composing workflows therefore rests on an unevaluated assumption rather than demonstrated soundness.
[Evaluation section] Evaluation section: No error analysis, false-positive/negative discussion, or external ground truth (e.g., runtime execution logs or independent expert labeling) is reported. The 168-variant test demonstrates robustness to structural rewrites but inherits any circularity present in the 48 base cases.

minor comments (1)

[Abstract] The abstract refers to future combination with 'community repositories of building blocks' without citing existing repositories or related work on workflow component libraries.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the independence of the evaluation and the justification of the rule set. We address each major comment below and indicate planned revisions to improve clarity and transparency.

read point-by-point responses

Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The claim that the verifier 'reliably detects violations' rests on 48 workflows labeled with 'known design flaws.' The manuscript provides no description of how these flaws were identified or whether identification was performed independently of the twelve proposed rules. If the base cases were constructed or labeled precisely by violation of those rules, successful detection is expected by construction and supplies no independent evidence that the rules catch real compatibility problems.

Authors: The manuscript does not describe the flaw identification process for the 48 workflows. The base cases were assembled from representative agentic workflow patterns, with flaws corresponding to violations of the structural rules. We acknowledge that this creates dependence on the rule set and limits claims of fully independent validation. In the revised manuscript we will add an explicit description of dataset construction in the Evaluation section, including the source patterns used and the manual labeling criteria, and will qualify the strength of the evidence accordingly. revision: yes
Referee: [Section describing the twelve structural rules (likely §3)] Section describing the twelve structural rules (likely §3): No derivation, formal justification, completeness argument, or comparison to alternative rule sets is supplied for the twelve rules. The central claim that the rules detect 'all relevant compatibility issues' when composing workflows therefore rests on an unevaluated assumption rather than demonstrated soundness.

Authors: The manuscript supplies no derivation, formal justification, or completeness argument for the twelve rules. The rules were obtained by enumerating recurring compatibility failures observed when composing agentic building blocks (interface mismatches, dependency cycles, resource conflicts, etc.). We do not claim the set is complete or exhaustive. In revision we will insert a dedicated subsection in §3 that states the origin of each rule, provides a short rationale drawn from conceptual modeling principles, and explicitly notes the absence of a completeness proof or comparison against alternative rule sets. revision: yes
Referee: [Evaluation section] Evaluation section: No error analysis, false-positive/negative discussion, or external ground truth (e.g., runtime execution logs or independent expert labeling) is reported. The 168-variant test demonstrates robustness to structural rewrites but inherits any circularity present in the 48 base cases.

Authors: The current Evaluation section contains no error analysis, false-positive/negative rates, or external ground truth. The 168-variant experiment tests invariance of the rules under logic-preserving graph rewrites and is therefore independent of the base-case labeling, yet the overall detection claim still rests on the 48 workflows. We will expand the section with a brief discussion of possible false positives (overly strict rules on valid but unconventional compositions) and will add a limitations paragraph noting the lack of runtime logs or independent expert labels. These additions will be marked as directions for future validation. revision: partial

Circularity Check

0 steps flagged

No circularity; evaluation uses external datasets

full rationale

The paper proposes twelve structural rules for design-time verification of agentic workflows, implements them in a prototype, and evaluates detection on two openly released datasets (48 workflows with known design flaws plus 168 logic-preserving variants). No equations, fitted parameters, self-citations, or uniqueness theorems appear in the derivation. The central claim of reliable detection does not reduce to quantities or labels defined by the authors' own prior work or by construction from the rules themselves; the datasets are presented as external. This is the most common honest non-finding for a rule-based checker.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assumption that the chosen twelve rules comprehensively capture compatibility; no free parameters or new physical entities are introduced.

axioms (1)

domain assumption Twelve structural rules suffice to detect incompatibilities in compositions of agentic workflow building blocks
The verification approach and its reported reliability depend on this completeness claim.

invented entities (1)

Reusable building blocks for agentic workflows no independent evidence
purpose: To enable modular composition and rule-based compatibility checking
Modeling choice introduced to support the verification framework; no independent evidence of completeness is supplied.

pith-pipeline@v0.9.1-grok · 5691 in / 1332 out tokens · 16231 ms · 2026-06-26T14:08:37.550722+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 11 canonical work pages · 3 internal anchors

[1]

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

“Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems”.arXiv preprint arXiv:2512.12791. Andriushchenko, M., A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang,et al

work page arXiv
[2]

On the Opportunities and Risks of Foundation Models

“On the Opportunities and Risks of Foundation Models”.arXiv(2108.07258). Borghoff, U. M., P. Bottoni, and R. Pareschi

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Why Do Multi-Agent LLM Systems Fail?

“Why Do Multi-Agent LLM Systems Fail?”.arXiv(2503.13657). Cheng, S., P. J. Giabbanelli, and Z. Kuang

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Compositional assume-guarantee reasoning for input/output component theories

“Compositional assume-guarantee reasoning for input/output component theories”.Science of Computer Programming91:115–137 https://doi.org/10.1016/j.scico.2013.12.010. Debenedetti, E., J. Zhang, M. Balunovicet al

work page doi:10.1016/j.scico.2013.12.010 2013
[5]

Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin

“Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin”. Accessed: 2026-02-22. DOMO

2026
[6]

A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges

“A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges”.arXiv preprint arXiv:2602.05883. Ironclad

work page arXiv
[7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

“Dspy: Compiling declarative language model calls into self-improving pipelines”.arXiv preprint arXiv:2310.03714. Knowlton, B., J. Campa, D. S. Gallo, K. Dajani, and N. Alzahrani

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Granularity of conflicts and dependencies in graph transformation systems: A two-dimensional approach

“Granularity of conflicts and dependencies in graph transformation systems: A two-dimensional approach”.Journal of Logical and Algebraic Methods in Programming103:105– 129 https://doi.org/10.1016/j.jlamp.2018.11.004. LangChain

work page doi:10.1016/j.jlamp.2018.11.004 2018
[9]

Guardians of the Agents

“Guardians of the Agents”.Communications of the ACM69(1):46–52. Meta 2025a. “Boosting Your Support and Safety on Meta’s Apps With AI”. https://about.fb.com/news/2026/03/ boosting-your-support-and-safety-on-metas-apps-with-ai/. Meta 2025b. “Community Standards”. https://transparency.meta.com/policies/community-standards/. Microsoft Research

2026
[10]

Flow: A Modular Approach to Automated Agentic Workflow Generation

“Flow: A Modular Approach to Automated Agentic Workflow Generation”.arXiv preprint arXiv:2501.07834. Page, E. H., and J. M. Opper

work page arXiv
[11]

Why Do Multiagent Systems Fail?

“Why Do Multiagent Systems Fail?”. In ICLR 2025 Workshop on Building Trust in Language Models and Applications. Parler

2025
[12]

Rebedea et al.NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails.arXiv:2310.10501,

“NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”.arXiv preprint arXiv:2310.10501. Ruan, Y ., H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba,et al

work page arXiv
[13]

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

“Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses”.arXiv(2504.02080). Wang, X.et al

work page arXiv
[14]

SoK: Evaluating Jailbreak Guardrails for Large Language Models

“SoK: Evaluating Jailbreak Guardrails for Large Language Models”.arXiv(2506.10597). Yao, S., J. Zhao, D. Yuet al

work page arXiv
[15]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

“Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents”. InFindings of the Association for Computational Linguistics: ACL 2024, 10471–10506. AUTHOR BIOGRAPHIES NOÉ FLANDREholds a Master’s degree in Computer Science and is an engineering graduate from IMT Mines Alès (France). His research interests lie at the in...

2024

[1] [1]

Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems

“Beyond Task Completion: An Assessment Framework for Evaluating Agentic AI Systems”.arXiv preprint arXiv:2512.12791. Andriushchenko, M., A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang,et al

work page arXiv

[2] [2]

On the Opportunities and Risks of Foundation Models

“On the Opportunities and Risks of Foundation Models”.arXiv(2108.07258). Borghoff, U. M., P. Bottoni, and R. Pareschi

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Why Do Multi-Agent LLM Systems Fail?

“Why Do Multi-Agent LLM Systems Fail?”.arXiv(2503.13657). Cheng, S., P. J. Giabbanelli, and Z. Kuang

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Compositional assume-guarantee reasoning for input/output component theories

“Compositional assume-guarantee reasoning for input/output component theories”.Science of Computer Programming91:115–137 https://doi.org/10.1016/j.scico.2013.12.010. Debenedetti, E., J. Zhang, M. Balunovicet al

work page doi:10.1016/j.scico.2013.12.010 2013

[5] [5]

Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin

“Build Secure AI Apps on Dify with Azure AI Content Safety Container Plugin”. Accessed: 2026-02-22. DOMO

2026

[6] [6]

A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges

“A Guide to Large Language Models in Modeling and Simulation: From Core Techniques to Critical Challenges”.arXiv preprint arXiv:2602.05883. Ironclad

work page arXiv

[7] [7]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

“Dspy: Compiling declarative language model calls into self-improving pipelines”.arXiv preprint arXiv:2310.03714. Knowlton, B., J. Campa, D. S. Gallo, K. Dajani, and N. Alzahrani

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Granularity of conflicts and dependencies in graph transformation systems: A two-dimensional approach

“Granularity of conflicts and dependencies in graph transformation systems: A two-dimensional approach”.Journal of Logical and Algebraic Methods in Programming103:105– 129 https://doi.org/10.1016/j.jlamp.2018.11.004. LangChain

work page doi:10.1016/j.jlamp.2018.11.004 2018

[9] [9]

Guardians of the Agents

“Guardians of the Agents”.Communications of the ACM69(1):46–52. Meta 2025a. “Boosting Your Support and Safety on Meta’s Apps With AI”. https://about.fb.com/news/2026/03/ boosting-your-support-and-safety-on-metas-apps-with-ai/. Meta 2025b. “Community Standards”. https://transparency.meta.com/policies/community-standards/. Microsoft Research

2026

[10] [10]

Flow: A Modular Approach to Automated Agentic Workflow Generation

“Flow: A Modular Approach to Automated Agentic Workflow Generation”.arXiv preprint arXiv:2501.07834. Page, E. H., and J. M. Opper

work page arXiv

[11] [11]

Why Do Multiagent Systems Fail?

“Why Do Multiagent Systems Fail?”. In ICLR 2025 Workshop on Building Trust in Language Models and Applications. Parler

2025

[12] [12]

Rebedea et al.NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails.arXiv:2310.10501,

“NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications with Programmable Rails”.arXiv preprint arXiv:2310.10501. Ruan, Y ., H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba,et al

work page arXiv

[13] [13]

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

“Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses”.arXiv(2504.02080). Wang, X.et al

work page arXiv

[14] [14]

SoK: Evaluating Jailbreak Guardrails for Large Language Models

“SoK: Evaluating Jailbreak Guardrails for Large Language Models”.arXiv(2506.10597). Yao, S., J. Zhao, D. Yuet al

work page arXiv

[15] [15]

Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

“Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents”. InFindings of the Association for Computational Linguistics: ACL 2024, 10471–10506. AUTHOR BIOGRAPHIES NOÉ FLANDREholds a Master’s degree in Computer Science and is an engineering graduate from IMT Mines Alès (France). His research interests lie at the in...

2024