RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

Amit Giloni; Itay Gershon; Lidor Erez; Oren Rachmil; Roman Vainshtein; Roy Betser; Sindhu Padakandla; Yarin Yerushalmi Levi

arxiv: 2606.23927 · v1 · pith:XUB4HHYHnew · submitted 2026-06-22 · 💻 cs.AI

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

Yarin Yerushalmi Levi , Roy Betser , Amit Giloni , Lidor Erez , Itay Gershon , Oren Rachmil , Sindhu Padakandla , Roman Vainshtein This is my paper

Pith reviewed 2026-06-26 08:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords agentic AI systemsred-teamingsecurity evaluationhierarchical graphadversarial attacksLLM agentsdynamic evaluationvulnerability assessment

0 comments

The pith

RIFT-Bench evaluates any agentic AI system by first extracting its structure as a hierarchical graph then launching adaptive adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RIFT-Bench as a unified method for security evaluation of agentic AI systems that avoids being locked to one implementation or domain. It works through two automated phases that first build a hierarchical representation of the system's structure and then use that structure to run adaptable attacks across many vectors. This matters because agentic systems introduce decision-making attack surfaces that earlier LLM-focused tests could not compare directly. The authors show the pipeline runs on 45 systems with varied designs and can also measure the effect of mitigations. If the approach holds, developers gain a repeatable way to check security without rebuilding test tools for each new architecture.

Core claim

RIFT-Bench is a graph representation-driven methodology for dynamic red-teaming of agentic AI systems. It operates in two automated phases: Discovery extracts the internal structure via a novel hierarchical representation, and Scanning deploys adaptive adversarial attacks to generate a comprehensive evaluation report. The method evaluates the system directly across diverse attack vectors and objectives, and testing on 45 heterogeneous agentic architectures shows it generalizes effectively. It also supports direct assessment of mitigation strategies.

What carries the argument

The hierarchical representation extracted in the Discovery phase, which models the agentic system's structure so the Scanning phase can adapt its adversarial probes to that structure.

If this is right

Mitigation strategies can be inserted and measured within the same automated pipeline.
Security comparisons become possible between agentic systems that use completely different code and designs.
Evaluation reports cover multiple attack objectives without requiring new probe sets for each system.
The pipeline can serve as a base layer for repeated assessments as agentic systems are updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-phase structure could be applied to non-agentic LLM applications that still contain tool-use loops.
Running the benchmark on many additional systems might surface recurring structural patterns that attackers exploit.
Developers could integrate the Discovery phase into CI pipelines to flag architecture changes that affect attack surface.
Extending the probe set with domain-specific objectives would let the method address industry-specific risks without changing the core machinery.

Load-bearing premise

The hierarchical graph built in the first phase must capture enough of the agentic system's actual internal workings for the adaptive attacks in the second phase to produce a complete security picture.

What would settle it

An agentic system on which RIFT-Bench reports no critical issues yet independent manual red-teaming finds a working exploit that the automated pipeline never surfaced.

Figures

Figures reproduced from arXiv: 2606.23927 by Amit Giloni, Itay Gershon, Lidor Erez, Oren Rachmil, Roman Vainshtein, Roy Betser, Sindhu Padakandla, Yarin Yerushalmi Levi.

**Figure 1.** Figure 1: RIFT-Bench pipeline. The framework first discovers a structured NodeSpec representation from the target [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Our proposed attack taxonomy: a hierarchical [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Scanning result analysis (ASR) by surface/architecture (a), objective/framework (b), and evaluator [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Utility and ASR under different defenses. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 7.** Figure 7: Example NodeSpec planner LLM node. Agent: executor_agent id: assistant_system_executor_agent node_type: Agent description: Executor agent that follows the generated plan and uses available tools. is_graph: true agent_type: {Executor} nodes: [executor_llm, search_tool] internal_edges: START -> executor_llm executor_llm -> search_tool search_tool -> executor_llm executor_llm -> END inputs: - plan: string out… view at source ↗

**Figure 8.** Figure 8: Example NodeSpec executor-agent node. instantiation: capability flags determine whether a probe is applicable, code references instruct where modifications should be inserted, and topology fields determine how components are related. This makes NodeSpec both descriptive and operational, supporting vulnerability analysis, attack adaptation, and controlled system modification across heterogeneous agentic im… view at source ↗

**Figure 9.** Figure 9: Example NodeSpec executor LLM node. Tool: search_tool id: assistant_system_executor_agent_search_tool node_type: Tool description: Search tool used by the executor agent to retrieve external information. inputs: - query: string outputs: - search_results: list tool_example_pairs: - input: {query: "weather in Paris"} output: {search_results: [...]} code_references: - definition, tools.py:12 - usage, executor… view at source ↗

**Figure 10.** Figure 10: Example NodeSpec tool node. scanning results are derived from executions that may use emulated tools, it is important to evaluate whether the emulators preserve sufficient fidelity to support valid conclusions. E.1 Emulation Goals We evaluate tool emulation along three dimensions. Semantic fidelity measures whether the emulated output is consistent with the tool documentation and invocation arguments. Beh… view at source ↗

**Figure 11.** Figure 11: Prompt fragment injected into the attacked tool description. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Node alignment coverage across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗

**Figure 13.** Figure 13: Code-reference overlap coefficient across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗

**Figure 14.** Figure 14: Required-key F1 across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗

**Figure 15.** Figure 15: Attack activation rate (AAR) versus deterministic attack success rate (ASR) by attack suite. Abbrevia [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗

**Figure 16.** Figure 16: Utility versus deterministic ASR for the nine [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

**Figure 17.** Figure 17: Utility decrees under attack for the nine [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗

**Figure 18.** Figure 18: Relation between execution drift and utility in the [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗

read the original abstract

Agentic AI systems powered by large language models (LLMs) are rapidly evolving into autonomous decision-making systems, exposing attack vectors beyond those of traditional LLM vulnerabilities. Existing security evaluations are often tied to specific implementations or domains, limiting unified comparison across heterogeneous systems. To address this gap, we introduce RIFT-Bench, a graph representation-driven methodology for dynamic red-teaming that enables unified evaluations across diverse agentic architectures. Building on a novel hierarchical representation, RIFT-Bench operates in two automated phases: Discovery, which extracts system structure, and Scanning, which deploys adaptive adversarial attacks and produces a comprehensive evaluation report. It evaluates the examined system itself, leveraging a broad set of dynamically adaptable adversarial probes across diverse attack vectors and objectives. We demonstrate the effectiveness of the proposed evaluation pipeline across 45 agentic systems spanning a diverse range of implementations, showing that the approach generalizes effectively to heterogeneous agentic architectures. Beyond systems and attacks, RIFT-Bench also supports direct evaluation of mitigation strategies. These key capabilities make RIFT-Bench a scalable foundation for security evaluation of agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RIFT-Bench sketches a two-phase graph pipeline for red-teaming agentic systems across 45 examples, but the abstract gives no evidence the discovery graphs are accurate enough for the scanning attacks to be comprehensive.

read the letter

RIFT-Bench tries to fill a gap by giving a single evaluation method for many different agentic AI setups instead of one-off tests tied to specific code. It uses a hierarchical graph built in a discovery phase, then runs adaptive attacks in a scanning phase, and also checks mitigations. The abstract says this was run on 45 systems of varying designs and claims the approach generalizes.

The combination of automated structure extraction plus dynamic probing is the main new element here. Treating the agent as a graph rather than a black box is a reasonable direction for handling the range of tool use, planning, and multi-step behavior in these systems. The fact that it supports mitigation testing is practical.

The soft spot is the discovery phase itself. The stress-test note is on target: if the system is closed or dynamically assembled, the graph can only come from observable calls and outputs. Nothing in the abstract shows that this produces a graph with the right control or data edges, or that missing edges do not leave major attack paths untested. Without any description of the extraction algorithm, any accuracy metric, or even sample graphs, the claim that the method works across heterogeneous architectures rests on an unexamined assumption.

No equations or fitted numbers appear, so there is nothing to check for circularity or fitting artifacts. The citation pattern cannot be assessed from the abstract.

This is for people already working on security evaluation of autonomous agents. A reader in that niche might want the full paper to see the actual implementation and numbers. It is worth sending to peer review so referees can check whether the graphs and attacks deliver what the abstract promises.

Referee Report

3 major / 2 minor

Summary. The paper introduces RIFT-Bench, a graph representation-driven methodology for dynamic red-teaming of agentic AI systems. It consists of two automated phases—Discovery, which extracts a hierarchical system structure, and Scanning, which deploys adaptive adversarial attacks—claiming to enable unified security evaluations across heterogeneous agentic architectures. The approach is demonstrated on 45 systems spanning diverse implementations and is said to generalize effectively while also supporting evaluation of mitigation strategies.

Significance. If the central claims hold, RIFT-Bench would offer a scalable, implementation-agnostic framework for evaluating security of autonomous agentic systems, filling a gap left by domain- or implementation-specific prior evaluations. This would be a meaningful contribution given the rapid deployment of such systems.

major comments (3)

[Abstract and §3] Abstract and §3 (Discovery phase): the claim that the extracted hierarchical graph representation 'accurately captures the internal structure' for subsequent adaptive probes is load-bearing for the generalization result across 45 systems, yet the manuscript provides no algorithm, pseudocode, or formal definition of the graph construction process; without this, it is impossible to assess whether observable I/O and API traces suffice for black-box or dynamically composed agents.
[Abstract and evaluation section] Abstract and evaluation section: the generalization claim rests on evaluation of 45 systems, but no quantitative results, attack success rates, coverage metrics, or comparison to baselines are reported beyond the system count; this leaves the effectiveness assertion unsupported by data.
[§4] §4 (Scanning phase): the description of 'dynamically adaptable adversarial probes across diverse attack vectors' is presented at an abstract level with no concrete probe definitions, adaptation rules, or threat model, making it impossible to verify that the probes target omitted control-flow or data-flow edges when the Discovery graph is incomplete.

minor comments (2)

Notation for the hierarchical representation is introduced without a diagram or running example, reducing clarity for readers attempting to replicate the pipeline.
[Abstract] The abstract states that RIFT-Bench 'supports direct evaluation of mitigation strategies,' but no example or interface for doing so is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on RIFT-Bench. The comments correctly identify areas where additional detail is needed to support the claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Discovery phase): the claim that the extracted hierarchical graph representation 'accurately captures the internal structure' for subsequent adaptive probes is load-bearing for the generalization result across 45 systems, yet the manuscript provides no algorithm, pseudocode, or formal definition of the graph construction process; without this, it is impossible to assess whether observable I/O and API traces suffice for black-box or dynamically composed agents.

Authors: We agree that the absence of a formal definition and pseudocode for the graph construction process in the Discovery phase limits verifiability. In the revised manuscript we will add an explicit algorithm description, pseudocode, and elaboration on how observable I/O and API traces are used to build the hierarchical representation. This will also clarify the assumptions and limitations for black-box and dynamically composed agents. revision: yes
Referee: [Abstract and evaluation section] Abstract and evaluation section: the generalization claim rests on evaluation of 45 systems, but no quantitative results, attack success rates, coverage metrics, or comparison to baselines are reported beyond the system count; this leaves the effectiveness assertion unsupported by data.

Authors: The manuscript demonstrates the pipeline on 45 systems but does not include the requested quantitative metrics or baseline comparisons. We will add a results subsection with attack success rates, coverage metrics, and baseline comparisons to substantiate the generalization and effectiveness claims. revision: yes
Referee: [§4] §4 (Scanning phase): the description of 'dynamically adaptable adversarial probes across diverse attack vectors' is presented at an abstract level with no concrete probe definitions, adaptation rules, or threat model, making it impossible to verify that the probes target omitted control-flow or data-flow edges when the Discovery graph is incomplete.

Authors: We will expand §4 with concrete probe definitions, adaptation rules, a threat model, and discussion of how probes address potentially missing edges in incomplete Discovery graphs. revision: yes

Circularity Check

0 steps flagged

No circularity: independent methodology construction

full rationale

The paper presents RIFT-Bench as a new graph-representation methodology with Discovery (structure extraction) and Scanning (adaptive attacks) phases. No equations, fitted parameters, predictions that reduce to inputs, or load-bearing self-citations appear in the abstract or described claims. The generalization result across 45 systems is presented as empirical demonstration rather than a definitional re-expression. The central claims rest on the independent construction of the hierarchical representation and probe deployment, not on any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details of graph construction and attack adaptation are omitted.

pith-pipeline@v0.9.1-grok · 5753 in / 1073 out tokens · 28251 ms · 2026-06-26T08:00:36.619111+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 8 canonical work pages · 2 internal anchors

[1]

ACM Computing Surveys, 57(1):1–37

Mitre att&ck: State of the art and way forward. ACM Computing Surveys, 57(1):1–37. Soufiane Amini, Yassine Benajiba, Cesare Bernardis, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, and 1 others. 2025. Open agent specification (agent spec): A unified representation for ai agents. arXiv preprint arX...

work page arXiv 2025
[2]

Derczynski et al.Garak: A Framework for Security Probing Large Language Models

Ai agents under threat: A survey of key secu- rity challenges and future pathways.ACM Comput- ing Surveys, 57(7):1–36. Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. 2024. garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036. Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao...

work page arXiv 2024
[3]

Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah

Wasp: Benchmarking web agent security against prompt injection attacks.Advances in Neural Information Processing Systems, 38. Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah. 2025. From prompt injec- tions to protocol exploits: Threats in llm-powered ai agents workflows.ICT Express. Matija...

work page arXiv 2025
[4]

Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026

Amazon bedrock agentcore observability with langfuse. Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026. Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Tr- isha Starostina, Kelly Marchisio, Seraphina Goldfarb- Tarrant, and Roman Vainshtein. 2025. Maps: A mul- tilingual benchm...

work page arXiv 2026
[5]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Au- mayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, and 1 others. 2025. Tool- sandbox: A stateful, conversational, interactive eval- uation benchmark for llm tool use capabilities. In Findings of the Association for Comput...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He

Training-free policy violation detection via activation-space whitening in llms.arXiv preprint arXiv:2512.03994. Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He. 2025. Gtm: Simulating the world of tools for ai agents. arXiv preprint arXiv:2512.04535. Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao ...

work page arXiv 2025
[7]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036

Megaagent: A large-scale autonomous llm- based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036. Webster. 2024. Promptfoo: Test and red-team llm appli- cations. Open-source framework for LLM evaluation and red teaming. xAI. 2025. Grok 4 Fast. Shunyu Yao, Jeffrey Zhao, Dian Y...

work page arXiv 2025
[8]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6216–6226. Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Gh- odsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, and 1 others. 2018. Acceleratin...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Node Creation Gate (applyfirstforevery candidate): - Create a node onlyifitispart of a meaningful runtime component of the agentic systeminthisfile. - Do NOT create nodesforhelper-only code suchasCLI/arg parsing,file/pathinput normalization,print/log-only wrappers, thin delegation wrappers that mostly validate inputsandcall one downstream function, schema...
[10]

After passing the gate, extract concrete variables/instances that are real runtime components (agents, tools, databases, MCP servers, LLM clients, routers, orchestrators )
[11]

other", other_description=

Unified config/prompt instance rule: - When config/prompt/instruction structures define distinct runtime behaviorsor roles, emit distinct component nodes per runtime instance/role evenifthey share the same Python implementationobject. - If a config/prompt artifactisruntime- meaningful butnotyet bound to a concrete componentinthisfile, emit a standalone no...
[12]

Do NOT emit one node per switch option

Switch-case variant rule: When the same runtime componentisinstantiated inside a switch-style conditional (for example Python match/case,or if/elifthat selects one config option), emit exactly ONE NodeSpecforthat component. Do NOT emit one node per switch option. Record option- specific differencesinmetadata (e.g., selected model/env keys/config values),n...
[13]

For each extracted component,setname to be ascloseaspossible to the actual variable/ instance name usedincode
[14]

Fill these fieldsforNEW nodes: name, node_type, description, code_references, inputs, outputs
[15]

Code references should includeanyrelevant evidenceforthe node (definition, initialization, implementation, usage,input /output schema, prompts/configs,orother related references)
[16]

Each distinct code piece must be a separate CodeReferenceobject(donotmerge multiple snippets into one CodeReference)
[17]

Put unresolved evidence into metadatainone consistent structure: - metadata.open_question: one short sentence forthe main unresolved point (ornullif none). - metadata.missing_evidence:listof objects with: - field: unresolved runtime field/binding/ dependency name, - reason: why itisunresolvedfromthis filealone, - evidence_code_refs_hint: short line/file h...
[18]

If descriptionisavailableincode, use it; otherwise write a concise placeholder
[19]

A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure )

Use node_type=System onlyfortrue system- level orchestration/container components of the agentic system. A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure ). Workflow runtimes that orchestrate multiple runtime nodes (forexample node- graph/state-machine style workflows) should be treatedasSystem. Do...
[20]

other", other_description=

If unsure of node_type, use NodeType(type=" other", other_description="...")
[21]

Output Rules:

If no new nodes are found,returnNEW_NODES = []. Output Rules:
[22]

Output valid Python code only (no markdownor prose)
[23]

The output must construct NodeSpec objects that conform to NodeSpec_schema.py
[24]

Include necessary imports (NodeSpec, NodeType, CodeReference, InputPort, OutputPort)
[25]

Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph

Assign the finallistto a top-level name NEW_NODES (a Pythonlist). Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph . Goal: Return only NEW child proposalsforthis parent inthisround(delta,notfulllist). Terminology (mandatory): - Agentic Component: a runtime unitwithits own operational roleinthe age...
[26]

Decide direct-child relations using Agentic Component boundaries,notcode-object proximity
[27]

Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component

Support artifacts arenotchildren by default: keep proposals at Agentic-Component level only. Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component. Example: when the parentisan Agent, an llm config isevidence of an`LLM`child of that Agent; when the parentisan LLM,...
[28]

Agent uses LLM

A direct child means`candidate`isdirectly attached under`parent_node`asits own agentic componentinthe agentic system ( containment/attachment boundary),notmerely referencedorindirectly used; examples: a System can have Agent children, an Agent can have Tool/Server/LLM children,anda Server can have Tool children, but "Agent uses LLM " doesnotmake the Agent...
[29]

Code structureisevidence only,whilechild decisions are about agentic componentsin the agentic system: wrappers/helpers/config/ prompt/schema artifacts may provide evidence forattachment, but they arenotchild nodes by themselves
[30]

Not direct when relation exists only through another distinct Agentic Component boundary (parent -> component_X -> candidate)
[31]

Donot returnduplicates already presentin discovered_children_so_far
[32]

8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence

Every childinchildren_add must include at least one attachment code referencein code_references_add_child. 8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence
[33]

Use`retrieved_evidence_context`to propose children before askingformore retrieval
[34]

Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts

If evidenceisincomplete,returnpartial progress now: includeallnewly found childreninchildren_addandinclude a context_requestforthe remaining unknowns ( donotwaitfora fulllistbefore responding). Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts. RAG will retrieve context
[35]

If existing evidence mentions a symbol/ header/template/import/prompt that may define attached agentic components, request information about that symbol before setting is_complete=true
[36]

Use`retrieval_history`to avoid repeating the same query unless you are explicitly refining it to target a different missing detail
[37]

Output rules:

Set is_complete=true only when no additional plausible direct children remain. Output rules:
[38]

Return exactly these top-level keys: children_add, is_complete, completion_reason , optional context_request
[39]

If is_complete=false, context_requestis requiredandmust include at least one needs item
[40]

If is_complete=true, omit context_request
[41]

Donot returnextra top-level keys
[42]

Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel)

For each added child, include exactly these fields: name, node_type, description,and code_references_add_child. Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel). Description should be plain textandshould follow evidenceincode when available
[43]

- Use kind="usage"for allentries

Attachment evidence requirements are strict: - code_references_add_child must contain only attachment/wiring evidence where the childisattached/used by this parent. - Use kind="usage"for allentries. 33 - Do NOT include full child definition/ implementation referencesinthis phase; those are added later
[44]

children_add

When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope. Output JSON: { "children_add...
[45]

STARTandEND are required virtual anchors andmust appear onlyinedges
[46]

Allowed endpoints are: START, END,and child_vars
[47]

Never invent endpoints outside allowed names
[48]

5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence

Prefer a single START entry edgeanda single END exit edge per parent, unless concrete evidence clearly shows multiple entry/exit points. 5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence
[49]

Use`child_nodes`semantics (role/type/ description/evidence) to decide plausible flow, but ground actual edge decisionsin concrete evidencefromparent/code context
[50]

8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated)

Prefer minimal, high-confidence mutations: - keep evidence-backed`current_edges`, - remove edges only when contradicted by stronger evidence, - add only edges you can justify. 8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated). 9)`reason`foreach mutation must be shortand explicit about intent (entry ro...
[51]

Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts

If evidenceisinsufficientfor anyrequired edge decision, donotguess; use context_requestinthe standardformat. Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts. RAG will retrieve context
[52]

When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope
[53]

edges_add

Return exactly these top-level keys:` edges_add`,`edges_remove`,andoptional` context_request`. Output JSON: { "edges_add": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "edges_remove": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "context_request": { "request_id": "co...

2024

[1] [1]

ACM Computing Surveys, 57(1):1–37

Mitre att&ck: State of the art and way forward. ACM Computing Surveys, 57(1):1–37. Soufiane Amini, Yassine Benajiba, Cesare Bernardis, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, and 1 others. 2025. Open agent specification (agent spec): A unified representation for ai agents. arXiv preprint arX...

work page arXiv 2025

[2] [2]

Derczynski et al.Garak: A Framework for Security Probing Large Language Models

Ai agents under threat: A survey of key secu- rity challenges and future pathways.ACM Comput- ing Surveys, 57(7):1–36. Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. 2024. garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036. Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao...

work page arXiv 2024

[3] [3]

Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah

Wasp: Benchmarking web agent security against prompt injection attacks.Advances in Neural Information Processing Systems, 38. Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah. 2025. From prompt injec- tions to protocol exploits: Threats in llm-powered ai agents workflows.ICT Express. Matija...

work page arXiv 2025

[4] [4]

Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026

Amazon bedrock agentcore observability with langfuse. Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026. Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Tr- isha Starostina, Kelly Marchisio, Seraphina Goldfarb- Tarrant, and Roman Vainshtein. 2025. Maps: A mul- tilingual benchm...

work page arXiv 2026

[5] [5]

DeepSeek-V3 Technical Report

Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Au- mayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, and 1 others. 2025. Tool- sandbox: A stateful, conversational, interactive eval- uation benchmark for llm tool use capabilities. In Findings of the Association for Comput...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He

Training-free policy violation detection via activation-space whitening in llms.arXiv preprint arXiv:2512.03994. Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He. 2025. Gtm: Simulating the world of tools for ai agents. arXiv preprint arXiv:2512.04535. Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao ...

work page arXiv 2025

[7] [7]

InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036

Megaagent: A large-scale autonomous llm- based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036. Webster. 2024. Promptfoo: Test and red-team llm appli- cations. Open-source framework for LLM evaluation and red teaming. xAI. 2025. Grok 4 Fast. Shunyu Yao, Jeffrey Zhao, Dian Y...

work page arXiv 2025

[8] [8]

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6216–6226. Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Gh- odsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, and 1 others. 2018. Acceleratin...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [9]

Node Creation Gate (applyfirstforevery candidate): - Create a node onlyifitispart of a meaningful runtime component of the agentic systeminthisfile. - Do NOT create nodesforhelper-only code suchasCLI/arg parsing,file/pathinput normalization,print/log-only wrappers, thin delegation wrappers that mostly validate inputsandcall one downstream function, schema...

[10] [10]

After passing the gate, extract concrete variables/instances that are real runtime components (agents, tools, databases, MCP servers, LLM clients, routers, orchestrators )

[11] [11]

other", other_description=

Unified config/prompt instance rule: - When config/prompt/instruction structures define distinct runtime behaviorsor roles, emit distinct component nodes per runtime instance/role evenifthey share the same Python implementationobject. - If a config/prompt artifactisruntime- meaningful butnotyet bound to a concrete componentinthisfile, emit a standalone no...

[12] [12]

Do NOT emit one node per switch option

Switch-case variant rule: When the same runtime componentisinstantiated inside a switch-style conditional (for example Python match/case,or if/elifthat selects one config option), emit exactly ONE NodeSpecforthat component. Do NOT emit one node per switch option. Record option- specific differencesinmetadata (e.g., selected model/env keys/config values),n...

[13] [13]

For each extracted component,setname to be ascloseaspossible to the actual variable/ instance name usedincode

[14] [14]

Fill these fieldsforNEW nodes: name, node_type, description, code_references, inputs, outputs

[15] [15]

Code references should includeanyrelevant evidenceforthe node (definition, initialization, implementation, usage,input /output schema, prompts/configs,orother related references)

[16] [16]

Each distinct code piece must be a separate CodeReferenceobject(donotmerge multiple snippets into one CodeReference)

[17] [17]

Put unresolved evidence into metadatainone consistent structure: - metadata.open_question: one short sentence forthe main unresolved point (ornullif none). - metadata.missing_evidence:listof objects with: - field: unresolved runtime field/binding/ dependency name, - reason: why itisunresolvedfromthis filealone, - evidence_code_refs_hint: short line/file h...

[18] [18]

If descriptionisavailableincode, use it; otherwise write a concise placeholder

[19] [19]

A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure )

Use node_type=System onlyfortrue system- level orchestration/container components of the agentic system. A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure ). Workflow runtimes that orchestrate multiple runtime nodes (forexample node- graph/state-machine style workflows) should be treatedasSystem. Do...

[20] [20]

other", other_description=

If unsure of node_type, use NodeType(type=" other", other_description="...")

[21] [21]

Output Rules:

If no new nodes are found,returnNEW_NODES = []. Output Rules:

[22] [22]

Output valid Python code only (no markdownor prose)

[23] [23]

The output must construct NodeSpec objects that conform to NodeSpec_schema.py

[24] [24]

Include necessary imports (NodeSpec, NodeType, CodeReference, InputPort, OutputPort)

[25] [25]

Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph

Assign the finallistto a top-level name NEW_NODES (a Pythonlist). Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph . Goal: Return only NEW child proposalsforthis parent inthisround(delta,notfulllist). Terminology (mandatory): - Agentic Component: a runtime unitwithits own operational roleinthe age...

[26] [26]

Decide direct-child relations using Agentic Component boundaries,notcode-object proximity

[27] [27]

Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component

Support artifacts arenotchildren by default: keep proposals at Agentic-Component level only. Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component. Example: when the parentisan Agent, an llm config isevidence of an`LLM`child of that Agent; when the parentisan LLM,...

[28] [28]

Agent uses LLM

A direct child means`candidate`isdirectly attached under`parent_node`asits own agentic componentinthe agentic system ( containment/attachment boundary),notmerely referencedorindirectly used; examples: a System can have Agent children, an Agent can have Tool/Server/LLM children,anda Server can have Tool children, but "Agent uses LLM " doesnotmake the Agent...

[29] [29]

Code structureisevidence only,whilechild decisions are about agentic componentsin the agentic system: wrappers/helpers/config/ prompt/schema artifacts may provide evidence forattachment, but they arenotchild nodes by themselves

[30] [30]

Not direct when relation exists only through another distinct Agentic Component boundary (parent -> component_X -> candidate)

[31] [31]

Donot returnduplicates already presentin discovered_children_so_far

[32] [32]

8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence

Every childinchildren_add must include at least one attachment code referencein code_references_add_child. 8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence

[33] [33]

Use`retrieved_evidence_context`to propose children before askingformore retrieval

[34] [34]

Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts

If evidenceisincomplete,returnpartial progress now: includeallnewly found childreninchildren_addandinclude a context_requestforthe remaining unknowns ( donotwaitfora fulllistbefore responding). Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts. RAG will retrieve context

[35] [35]

If existing evidence mentions a symbol/ header/template/import/prompt that may define attached agentic components, request information about that symbol before setting is_complete=true

[36] [36]

Use`retrieval_history`to avoid repeating the same query unless you are explicitly refining it to target a different missing detail

[37] [37]

Output rules:

Set is_complete=true only when no additional plausible direct children remain. Output rules:

[38] [38]

Return exactly these top-level keys: children_add, is_complete, completion_reason , optional context_request

[39] [39]

If is_complete=false, context_requestis requiredandmust include at least one needs item

[40] [40]

If is_complete=true, omit context_request

[41] [41]

Donot returnextra top-level keys

[42] [42]

Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel)

For each added child, include exactly these fields: name, node_type, description,and code_references_add_child. Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel). Description should be plain textandshould follow evidenceincode when available

[43] [43]

- Use kind="usage"for allentries

Attachment evidence requirements are strict: - code_references_add_child must contain only attachment/wiring evidence where the childisattached/used by this parent. - Use kind="usage"for allentries. 33 - Do NOT include full child definition/ implementation referencesinthis phase; those are added later

[44] [44]

children_add

When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope. Output JSON: { "children_add...

[45] [45]

STARTandEND are required virtual anchors andmust appear onlyinedges

[46] [46]

Allowed endpoints are: START, END,and child_vars

[47] [47]

Never invent endpoints outside allowed names

[48] [48]

5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence

Prefer a single START entry edgeanda single END exit edge per parent, unless concrete evidence clearly shows multiple entry/exit points. 5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence

[49] [49]

Use`child_nodes`semantics (role/type/ description/evidence) to decide plausible flow, but ground actual edge decisionsin concrete evidencefromparent/code context

[50] [50]

8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated)

Prefer minimal, high-confidence mutations: - keep evidence-backed`current_edges`, - remove edges only when contradicted by stronger evidence, - add only edges you can justify. 8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated). 9)`reason`foreach mutation must be shortand explicit about intent (entry ro...

[51] [51]

Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts

If evidenceisinsufficientfor anyrequired edge decision, donotguess; use context_requestinthe standardformat. Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts. RAG will retrieve context

[52] [52]

When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope

[53] [53]

edges_add

Return exactly these top-level keys:` edges_add`,`edges_remove`,andoptional` context_request`. Output JSON: { "edges_add": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "edges_remove": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "context_request": { "request_id": "co...

2024