pith. sign in

arxiv: 2606.23927 · v1 · pith:XUB4HHYHnew · submitted 2026-06-22 · 💻 cs.AI

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

Pith reviewed 2026-06-26 08:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords agentic AI systemsred-teamingsecurity evaluationhierarchical graphadversarial attacksLLM agentsdynamic evaluationvulnerability assessment
0
0 comments X

The pith

RIFT-Bench evaluates any agentic AI system by first extracting its structure as a hierarchical graph then launching adaptive adversarial attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RIFT-Bench as a unified method for security evaluation of agentic AI systems that avoids being locked to one implementation or domain. It works through two automated phases that first build a hierarchical representation of the system's structure and then use that structure to run adaptable attacks across many vectors. This matters because agentic systems introduce decision-making attack surfaces that earlier LLM-focused tests could not compare directly. The authors show the pipeline runs on 45 systems with varied designs and can also measure the effect of mitigations. If the approach holds, developers gain a repeatable way to check security without rebuilding test tools for each new architecture.

Core claim

RIFT-Bench is a graph representation-driven methodology for dynamic red-teaming of agentic AI systems. It operates in two automated phases: Discovery extracts the internal structure via a novel hierarchical representation, and Scanning deploys adaptive adversarial attacks to generate a comprehensive evaluation report. The method evaluates the system directly across diverse attack vectors and objectives, and testing on 45 heterogeneous agentic architectures shows it generalizes effectively. It also supports direct assessment of mitigation strategies.

What carries the argument

The hierarchical representation extracted in the Discovery phase, which models the agentic system's structure so the Scanning phase can adapt its adversarial probes to that structure.

If this is right

  • Mitigation strategies can be inserted and measured within the same automated pipeline.
  • Security comparisons become possible between agentic systems that use completely different code and designs.
  • Evaluation reports cover multiple attack objectives without requiring new probe sets for each system.
  • The pipeline can serve as a base layer for repeated assessments as agentic systems are updated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same two-phase structure could be applied to non-agentic LLM applications that still contain tool-use loops.
  • Running the benchmark on many additional systems might surface recurring structural patterns that attackers exploit.
  • Developers could integrate the Discovery phase into CI pipelines to flag architecture changes that affect attack surface.
  • Extending the probe set with domain-specific objectives would let the method address industry-specific risks without changing the core machinery.

Load-bearing premise

The hierarchical graph built in the first phase must capture enough of the agentic system's actual internal workings for the adaptive attacks in the second phase to produce a complete security picture.

What would settle it

An agentic system on which RIFT-Bench reports no critical issues yet independent manual red-teaming finds a working exploit that the automated pipeline never surfaced.

Figures

Figures reproduced from arXiv: 2606.23927 by Amit Giloni, Itay Gershon, Lidor Erez, Oren Rachmil, Roman Vainshtein, Roy Betser, Sindhu Padakandla, Yarin Yerushalmi Levi.

Figure 1
Figure 1. Figure 1: RIFT-Bench pipeline. The framework first discovers a structured NodeSpec representation from the target [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed attack taxonomy: a hierarchical [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scanning result analysis (ASR) by surface/architecture (a), objective/framework (b), and evaluator [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Utility and ASR under different defenses. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: Example NodeSpec planner LLM node. Agent: executor_agent id: assistant_system_executor_agent node_type: Agent description: Executor agent that follows the generated plan and uses available tools. is_graph: true agent_type: {Executor} nodes: [executor_llm, search_tool] internal_edges: START -> executor_llm executor_llm -> search_tool search_tool -> executor_llm executor_llm -> END inputs: - plan: string out… view at source ↗
Figure 8
Figure 8. Figure 8: Example NodeSpec executor-agent node. instantiation: capability flags determine whether a probe is applicable, code references instruct where modifications should be inserted, and topology fields determine how components are related. This makes NodeSpec both descriptive and operational, supporting vulnerability analysis, attack adaptation, and controlled system modification across hetero￾geneous agentic im… view at source ↗
Figure 9
Figure 9. Figure 9: Example NodeSpec executor LLM node. Tool: search_tool id: assistant_system_executor_agent_search_tool node_type: Tool description: Search tool used by the executor agent to retrieve external information. inputs: - query: string outputs: - search_results: list tool_example_pairs: - input: {query: "weather in Paris"} output: {search_results: [...]} code_references: - definition, tools.py:12 - usage, executor… view at source ↗
Figure 10
Figure 10. Figure 10: Example NodeSpec tool node. scanning results are derived from executions that may use emulated tools, it is important to evaluate whether the emulators preserve sufficient fidelity to support valid conclusions. E.1 Emulation Goals We evaluate tool emulation along three dimensions. Semantic fidelity measures whether the emulated output is consistent with the tool documentation and invocation arguments. Beh… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt fragment injected into the attacked tool description. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Node alignment coverage across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p031_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Code-reference overlap coefficient across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p031_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Required-key F1 across Structure Identifier stages. [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Attack activation rate (AAR) versus deterministic attack success rate (ASR) by attack suite. Abbrevia [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Utility versus deterministic ASR for the nine [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Utility decrees under attack for the nine [PITH_FULL_IMAGE:figures/full_fig_p038_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Relation between execution drift and utility in the [PITH_FULL_IMAGE:figures/full_fig_p039_18.png] view at source ↗
read the original abstract

Agentic AI systems powered by large language models (LLMs) are rapidly evolving into autonomous decision-making systems, exposing attack vectors beyond those of traditional LLM vulnerabilities. Existing security evaluations are often tied to specific implementations or domains, limiting unified comparison across heterogeneous systems. To address this gap, we introduce RIFT-Bench, a graph representation-driven methodology for dynamic red-teaming that enables unified evaluations across diverse agentic architectures. Building on a novel hierarchical representation, RIFT-Bench operates in two automated phases: Discovery, which extracts system structure, and Scanning, which deploys adaptive adversarial attacks and produces a comprehensive evaluation report. It evaluates the examined system itself, leveraging a broad set of dynamically adaptable adversarial probes across diverse attack vectors and objectives. We demonstrate the effectiveness of the proposed evaluation pipeline across 45 agentic systems spanning a diverse range of implementations, showing that the approach generalizes effectively to heterogeneous agentic architectures. Beyond systems and attacks, RIFT-Bench also supports direct evaluation of mitigation strategies. These key capabilities make RIFT-Bench a scalable foundation for security evaluation of agentic AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces RIFT-Bench, a graph representation-driven methodology for dynamic red-teaming of agentic AI systems. It consists of two automated phases—Discovery, which extracts a hierarchical system structure, and Scanning, which deploys adaptive adversarial attacks—claiming to enable unified security evaluations across heterogeneous agentic architectures. The approach is demonstrated on 45 systems spanning diverse implementations and is said to generalize effectively while also supporting evaluation of mitigation strategies.

Significance. If the central claims hold, RIFT-Bench would offer a scalable, implementation-agnostic framework for evaluating security of autonomous agentic systems, filling a gap left by domain- or implementation-specific prior evaluations. This would be a meaningful contribution given the rapid deployment of such systems.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Discovery phase): the claim that the extracted hierarchical graph representation 'accurately captures the internal structure' for subsequent adaptive probes is load-bearing for the generalization result across 45 systems, yet the manuscript provides no algorithm, pseudocode, or formal definition of the graph construction process; without this, it is impossible to assess whether observable I/O and API traces suffice for black-box or dynamically composed agents.
  2. [Abstract and evaluation section] Abstract and evaluation section: the generalization claim rests on evaluation of 45 systems, but no quantitative results, attack success rates, coverage metrics, or comparison to baselines are reported beyond the system count; this leaves the effectiveness assertion unsupported by data.
  3. [§4] §4 (Scanning phase): the description of 'dynamically adaptable adversarial probes across diverse attack vectors' is presented at an abstract level with no concrete probe definitions, adaptation rules, or threat model, making it impossible to verify that the probes target omitted control-flow or data-flow edges when the Discovery graph is incomplete.
minor comments (2)
  1. Notation for the hierarchical representation is introduced without a diagram or running example, reducing clarity for readers attempting to replicate the pipeline.
  2. [Abstract] The abstract states that RIFT-Bench 'supports direct evaluation of mitigation strategies,' but no example or interface for doing so is described.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on RIFT-Bench. The comments correctly identify areas where additional detail is needed to support the claims. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Discovery phase): the claim that the extracted hierarchical graph representation 'accurately captures the internal structure' for subsequent adaptive probes is load-bearing for the generalization result across 45 systems, yet the manuscript provides no algorithm, pseudocode, or formal definition of the graph construction process; without this, it is impossible to assess whether observable I/O and API traces suffice for black-box or dynamically composed agents.

    Authors: We agree that the absence of a formal definition and pseudocode for the graph construction process in the Discovery phase limits verifiability. In the revised manuscript we will add an explicit algorithm description, pseudocode, and elaboration on how observable I/O and API traces are used to build the hierarchical representation. This will also clarify the assumptions and limitations for black-box and dynamically composed agents. revision: yes

  2. Referee: [Abstract and evaluation section] Abstract and evaluation section: the generalization claim rests on evaluation of 45 systems, but no quantitative results, attack success rates, coverage metrics, or comparison to baselines are reported beyond the system count; this leaves the effectiveness assertion unsupported by data.

    Authors: The manuscript demonstrates the pipeline on 45 systems but does not include the requested quantitative metrics or baseline comparisons. We will add a results subsection with attack success rates, coverage metrics, and baseline comparisons to substantiate the generalization and effectiveness claims. revision: yes

  3. Referee: [§4] §4 (Scanning phase): the description of 'dynamically adaptable adversarial probes across diverse attack vectors' is presented at an abstract level with no concrete probe definitions, adaptation rules, or threat model, making it impossible to verify that the probes target omitted control-flow or data-flow edges when the Discovery graph is incomplete.

    Authors: We will expand §4 with concrete probe definitions, adaptation rules, a threat model, and discussion of how probes address potentially missing edges in incomplete Discovery graphs. revision: yes

Circularity Check

0 steps flagged

No circularity: independent methodology construction

full rationale

The paper presents RIFT-Bench as a new graph-representation methodology with Discovery (structure extraction) and Scanning (adaptive attacks) phases. No equations, fitted parameters, predictions that reduce to inputs, or load-bearing self-citations appear in the abstract or described claims. The generalization result across 45 systems is presented as empirical demonstration rather than a definitional re-expression. The central claims rest on the independent construction of the hierarchical representation and probe deployment, not on any tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; all technical details of graph construction and attack adaptation are omitted.

pith-pipeline@v0.9.1-grok · 5753 in / 1073 out tokens · 28251 ms · 2026-06-26T08:00:36.619111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    ACM Computing Surveys, 57(1):1–37

    Mitre att&ck: State of the art and way forward. ACM Computing Surveys, 57(1):1–37. Soufiane Amini, Yassine Benajiba, Cesare Bernardis, Paul Cayet, Hassan Chafi, Abderrahim Fathan, Louis Faucon, Damien Hilloulin, Sungpack Hong, Ingo Kossyk, and 1 others. 2025. Open agent specification (agent spec): A unified representation for ai agents. arXiv preprint arX...

  2. [2]

    Derczynski et al.Garak: A Framework for Security Probing Large Language Models

    Ai agents under threat: A survey of key secu- rity challenges and future pathways.ACM Comput- ing Surveys, 57(7):1–36. Leon Derczynski, Erick Galinkin, Jeffrey Martin, Subho Majumdar, and Nanna Inie. 2024. garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036. Jianshuo Dong, Sheng Guo, Hao Wang, Xun Chen, Zhuotao...

  3. [3]

    Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah

    Wasp: Benchmarking web agent security against prompt injection attacks.Advances in Neural Information Processing Systems, 38. Mohamed Amine Ferrag, Norbert Tihanyi, Djallel Hamouda, Leandros Maglaras, Abderrahmane Lakas, and Merouane Debbah. 2025. From prompt injec- tions to protocol exploits: Threats in llm-powered ai agents workflows.ICT Express. Matija...

  4. [4]

    Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026

    Amazon bedrock agentcore observability with langfuse. Shows how Langfuse is integrated for deep traceability of agent execution; accessed 2026. Omer Hofman, Jonathan Brokman, Oren Rachmil, Shamik Bose, Vikas Pahuja, Toshiya Shimizu, Tr- isha Starostina, Kelly Marchisio, Seraphina Goldfarb- Tarrant, and Roman Vainshtein. 2025. Maps: A mul- tilingual benchm...

  5. [5]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437. Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Au- mayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, and 1 others. 2025. Tool- sandbox: A stateful, conversational, interactive eval- uation benchmark for llm tool use capabilities. In Findings of the Association for Comput...

  6. [6]

    Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He

    Training-free policy violation detection via activation-space whitening in llms.arXiv preprint arXiv:2512.03994. Zhenzhen Ren, Xinpeng Zhang, Zhenxing Qian, Yan Gao, Yu Shi, Shuxin Zheng, and Jiyan He. 2025. Gtm: Simulating the world of tools for ai agents. arXiv preprint arXiv:2512.04535. Yangjun Ruan, Honghua Dong, Andrew Wang, Sil- viu Pitis, Yongchao ...

  7. [7]

    InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036

    Megaagent: A large-scale autonomous llm- based multi-agent system without predefined sops. InFindings of the Association for Computational Linguistics: ACL 2025, pages 4998–5036. Webster. 2024. Promptfoo: Test and red-team llm appli- cations. Open-source framework for LLM evaluation and red teaming. xAI. 2025. Grok 4 Fast. Shunyu Yao, Jeffrey Zhao, Dian Y...

  8. [8]

    Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

    A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 2, pages 6216–6226. Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Gh- odsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, and 1 others. 2018. Acceleratin...

  9. [9]

    Node Creation Gate (applyfirstforevery candidate): - Create a node onlyifitispart of a meaningful runtime component of the agentic systeminthisfile. - Do NOT create nodesforhelper-only code suchasCLI/arg parsing,file/pathinput normalization,print/log-only wrappers, thin delegation wrappers that mostly validate inputsandcall one downstream function, schema...

  10. [10]

    After passing the gate, extract concrete variables/instances that are real runtime components (agents, tools, databases, MCP servers, LLM clients, routers, orchestrators )

  11. [11]

    other", other_description=

    Unified config/prompt instance rule: - When config/prompt/instruction structures define distinct runtime behaviorsor roles, emit distinct component nodes per runtime instance/role evenifthey share the same Python implementationobject. - If a config/prompt artifactisruntime- meaningful butnotyet bound to a concrete componentinthisfile, emit a standalone no...

  12. [12]

    Do NOT emit one node per switch option

    Switch-case variant rule: When the same runtime componentisinstantiated inside a switch-style conditional (for example Python match/case,or if/elifthat selects one config option), emit exactly ONE NodeSpecforthat component. Do NOT emit one node per switch option. Record option- specific differencesinmetadata (e.g., selected model/env keys/config values),n...

  13. [13]

    For each extracted component,setname to be ascloseaspossible to the actual variable/ instance name usedincode

  14. [14]

    Fill these fieldsforNEW nodes: name, node_type, description, code_references, inputs, outputs

  15. [15]

    Code references should includeanyrelevant evidenceforthe node (definition, initialization, implementation, usage,input /output schema, prompts/configs,orother related references)

  16. [16]

    Each distinct code piece must be a separate CodeReferenceobject(donotmerge multiple snippets into one CodeReference)

  17. [17]

    Put unresolved evidence into metadatainone consistent structure: - metadata.open_question: one short sentence forthe main unresolved point (ornullif none). - metadata.missing_evidence:listof objects with: - field: unresolved runtime field/binding/ dependency name, - reason: why itisunresolvedfromthis filealone, - evidence_code_refs_hint: short line/file h...

  18. [18]

    If descriptionisavailableincode, use it; otherwise write a concise placeholder

  19. [19]

    A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure )

    Use node_type=System onlyfortrue system- level orchestration/container components of the agentic system. A System must includeor coordinate oneormore Agent components ( directlyorthrough nested runtime structure ). Workflow runtimes that orchestrate multiple runtime nodes (forexample node- graph/state-machine style workflows) should be treatedasSystem. Do...

  20. [20]

    other", other_description=

    If unsure of node_type, use NodeType(type=" other", other_description="...")

  21. [21]

    Output Rules:

    If no new nodes are found,returnNEW_NODES = []. Output Rules:

  22. [22]

    Output valid Python code only (no markdownor prose)

  23. [23]

    The output must construct NodeSpec objects that conform to NodeSpec_schema.py

  24. [24]

    Include necessary imports (NodeSpec, NodeType, CodeReference, InputPort, OutputPort)

  25. [25]

    Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph

    Assign the finallistto a top-level name NEW_NODES (a Pythonlist). Child discovery: You are discovering direct children of a runtime parent componentinan agentic system graph . Goal: Return only NEW child proposalsforthis parent inthisround(delta,notfulllist). Terminology (mandatory): - Agentic Component: a runtime unitwithits own operational roleinthe age...

  26. [26]

    Decide direct-child relations using Agentic Component boundaries,notcode-object proximity

  27. [27]

    Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component

    Support artifacts arenotchildren by default: keep proposals at Agentic-Component level only. Exception:ifthe artifactisthe concrete instantiated binding of a distinct Agentic Component used by the parent, it must be emittedasthat component. Example: when the parentisan Agent, an llm config isevidence of an`LLM`child of that Agent; when the parentisan LLM,...

  28. [28]

    Agent uses LLM

    A direct child means`candidate`isdirectly attached under`parent_node`asits own agentic componentinthe agentic system ( containment/attachment boundary),notmerely referencedorindirectly used; examples: a System can have Agent children, an Agent can have Tool/Server/LLM children,anda Server can have Tool children, but "Agent uses LLM " doesnotmake the Agent...

  29. [29]

    Code structureisevidence only,whilechild decisions are about agentic componentsin the agentic system: wrappers/helpers/config/ prompt/schema artifacts may provide evidence forattachment, but they arenotchild nodes by themselves

  30. [30]

    Not direct when relation exists only through another distinct Agentic Component boundary (parent -> component_X -> candidate)

  31. [31]

    Donot returnduplicates already presentin discovered_children_so_far

  32. [32]

    8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence

    Every childinchildren_add must include at least one attachment code referencein code_references_add_child. 8)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasdirect evidenceand never let it replace concrete code evidence

  33. [33]

    Use`retrieved_evidence_context`to propose children before askingformore retrieval

  34. [34]

    Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts

    If evidenceisincomplete,returnpartial progress now: includeallnewly found childreninchildren_addandinclude a context_requestforthe remaining unknowns ( donotwaitfora fulllistbefore responding). Searchforevidence until decisions are evidence-backed; never guess orfabricate missing facts. RAG will retrieve context

  35. [35]

    If existing evidence mentions a symbol/ header/template/import/prompt that may define attached agentic components, request information about that symbol before setting is_complete=true

  36. [36]

    Use`retrieval_history`to avoid repeating the same query unless you are explicitly refining it to target a different missing detail

  37. [37]

    Output rules:

    Set is_complete=true only when no additional plausible direct children remain. Output rules:

  38. [38]

    Return exactly these top-level keys: children_add, is_complete, completion_reason , optional context_request

  39. [39]

    If is_complete=false, context_requestis requiredandmust include at least one needs item

  40. [40]

    If is_complete=true, omit context_request

  41. [41]

    Donot returnextra top-level keys

  42. [42]

    Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel)

    For each added child, include exactly these fields: name, node_type, description,and code_references_add_child. Name should match the concrete runtime instance name usedin code when available (nota genericclass/ typelabel). Description should be plain textandshould follow evidenceincode when available

  43. [43]

    - Use kind="usage"for allentries

    Attachment evidence requirements are strict: - code_references_add_child must contain only attachment/wiring evidence where the childisattached/used by this parent. - Use kind="usage"for allentries. 33 - Do NOT include full child definition/ implementation referencesinthis phase; those are added later

  44. [44]

    children_add

    When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope. Output JSON: { "children_add...

  45. [45]

    STARTandEND are required virtual anchors andmust appear onlyinedges

  46. [46]

    Allowed endpoints are: START, END,and child_vars

  47. [47]

    Never invent endpoints outside allowed names

  48. [48]

    5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence

    Prefer a single START entry edgeanda single END exit edge per parent, unless concrete evidence clearly shows multiple entry/exit points. 5)`guidance_summary`provides high-level direction only; use it to guide search focus , but never treat itasedge evidenceand never let it replace concrete code evidence

  49. [49]

    Use`child_nodes`semantics (role/type/ description/evidence) to decide plausible flow, but ground actual edge decisionsin concrete evidencefromparent/code context

  50. [50]

    8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated)

    Prefer minimal, high-confidence mutations: - keep evidence-backed`current_edges`, - remove edges only when contradicted by stronger evidence, - add only edges you can justify. 8)`edges_add`and`edges_remove`must be idempotentandnon-duplicative (same from_/ to shouldnotbe repeated). 9)`reason`foreach mutation must be shortand explicit about intent (entry ro...

  51. [51]

    Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts

    If evidenceisinsufficientfor anyrequired edge decision, donotguess; use context_requestinthe standardformat. Searchforevidence until decisions are evidence-backed; never guessorfabricate missing facts. RAG will retrieve context

  52. [52]

    When returning context_request, each needs[]. query must target exactly one missing fact using concrete code literals (symbol names, importlines, assignments, call expressions, orexactfilename strings) rather than natural-language requests;ifunresolved, issue a refined literal queryforthe same fact instead of broadening scope

  53. [53]

    edges_add

    Return exactly these top-level keys:` edges_add`,`edges_remove`,andoptional` context_request`. Output JSON: { "edges_add": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "edges_remove": [ {"from_": "START|<child_var>", "to": "< child_var>|END", "reason": "<short reason>"} ], "context_request": { "request_id": "co...