Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Adrian Taylor; Chung-Horng Lung; Igor Bogdanov; Jie Gao; Marzia Zaman; Thomas Kunz

arxiv: 2605.16205 · v1 · pith:5KKTXIYEnew · submitted 2026-05-15 · 💻 cs.AI · cs.CL· cs.LG· cs.MA· cs.SY· eess.SY

Context, Reasoning, and Hierarchy: A Cost-Performance Study of Compound LLM Agent Design in an Adversarial POMDP

Igor Bogdanov , Chung-Horng Lung , Thomas Kunz , Jie Gao , Adrian Taylor , Marzia Zaman This is my paper

Pith reviewed 2026-05-20 18:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LGcs.MAcs.SYeess.SY

keywords LLM agentscompound agentsPOMDPstate abstractionhierarchical decompositiondeliberation toolscost performanceadversarial environments

0 comments

The pith

Programmatic state abstraction improves LLM agent returns by up to 76 percent per token over raw observations in adversarial POMDPs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how choices about what an agent observes, how it reasons internally, and how work is split among components shape both reward and token costs when LLM agents operate in an adversarial, partially observable setting. It runs twelve configurations across six models and nearly thirty-five hundred episodes, tracking every token. The data show that a programmatic layer which tracks and compresses state history produces the largest efficiency gains while adding self-critique and self-improvement tools to a hierarchy tends to lower performance and raise costs. A reader would care because these patterns supply concrete guidance on where to spend engineering effort when building reliable autonomous agents under uncertainty.

Core claim

In the tested adversarial POMDP, replacing raw observations with a deterministic state-tracking layer that compresses history raises mean return per token spent by as much as 76 percent. Hierarchical decomposition without any deliberation tools yields the highest absolute performance for most models. Distributing self-questioning, self-critique, and self-improvement tools across the hierarchy produces a deliberation cascade that cuts mean return by up to 3.4 times while consuming 1.8 to 2.7 times more tokens. Context engineering therefore proves more cost-effective than deeper per-agent reasoning.

What carries the argument

The controlled comparison of three design axes—context representation (raw observations versus programmatic state abstraction), deliberation tools (self-questioning, critique, and improvement), and hierarchy (monolithic versus specialized sub-agents)—with full token-level cost accounting across 3,475 episodes.

If this is right

Programmatic state abstraction delivers the largest returns per token spent across the tested model families.
Hierarchical decomposition without deliberation achieves the best absolute performance for most models.
Distributing deliberation tools across a hierarchy triggers a deliberation cascade that degrades mean return while increasing token consumption.
Context engineering is generally more cost-effective than adding deliberation capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same priority on clean state infrastructure over added reasoning layers may hold in other partially observable sequential domains such as robotics or logistics.
System builders could replace some prompting complexity with lightweight programmatic trackers to reduce inference spend without loss of capability.
The observed interference implies that reasoning depth is best controlled at the overall architecture level rather than multiplied inside every sub-agent.

Load-bearing premise

The twelve configurations and the reward structure of this particular simulator are representative enough of other adversarial POMDPs that the observed ranking of context and hierarchy over deliberation will generalize.

What would settle it

Re-running the identical twelve configurations inside a different adversarial POMDP simulator that uses a materially different reward function and checking whether state abstraction still produces the highest returns per token while hierarchy-plus-deliberation still underperforms.

Figures

Figures reproduced from arXiv: 2605.16205 by Adrian Taylor, Chung-Horng Lung, Igor Bogdanov, Jie Gao, Marzia Zaman, Thomas Kunz.

**Figure 2.** Figure 2: Context engineering heatmap. Each cell shows [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Deliberation cascade effect. Paired bars show mean [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Cost-performance Pareto frontiers. Points shaped by axis (circles: context, triangles: deliberation, squares: hierarchy). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Catastrophic failure rate (return < −150) by model and configuration. G2.5FL fails across all configurations; context engineering reduces catastrophic rates for most other models. performing best, consistent with its difficulty exploiting structured context. Hierarchy often improves absolute return but is less token-efficient: hier-base consumes substantially more tokens than obs+net, yielding lower RPTS… view at source ↗

**Figure 6.** Figure 6: Best configuration per axis compared to the shared [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Marginal value of adding individual context com [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Raw observation penalty. Gap between obs-only and the structured hist+net anchor configuration per model. Longer bars indicate larger benefit from replacing raw observations with programmatic context. D.2 Hierarchy and Architecture Summary [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Context component waterfall. Additive effect of history and network status on top of raw observation. Green = [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Context component interaction. obs+net compared [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Model fingerprints. Each radar shows normalized best performance on three axes (context, deliberation, hierarchy). [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 14.** Figure 14: Deliberation ROI. Each point shows one model– [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 13.** Figure 13: Hierarchy degradation. Performance change when [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 17.** Figure 17: Deliberation progression. Performance trajectory [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 20.** Figure 20: Model ranking stability. Lines connect each [PITH_FULL_IMAGE:figures/full_fig_p023_20.png] view at source ↗

**Figure 19.** Figure 19: Global head-to-head win-rate matrix. Compares [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗

**Figure 22.** Figure 22: Score distributions by design axis. Violins show [PITH_FULL_IMAGE:figures/full_fig_p025_22.png] view at source ↗

**Figure 25.** Figure 25: Outcome breakdown for the +COT configura [PITH_FULL_IMAGE:figures/full_fig_p025_25.png] view at source ↗

**Figure 26.** Figure 26: Token cost progression from cheapest (obs) to most expensive (hier-delib). Deliberation and hierarchy dramatically increase token consumption; the deliberation cascade represents the cost ceiling [PITH_FULL_IMAGE:figures/full_fig_p025_26.png] view at source ↗

**Figure 27.** Figure 27: Token profile shift. Stacked bars show prompt [PITH_FULL_IMAGE:figures/full_fig_p026_27.png] view at source ↗

**Figure 28.** Figure 28: Token generation velocity. Shows the exponential [PITH_FULL_IMAGE:figures/full_fig_p027_28.png] view at source ↗

**Figure 29.** Figure 29: Cost vs. Win Rate. Plots the token cost multiplier [PITH_FULL_IMAGE:figures/full_fig_p028_29.png] view at source ↗

read the original abstract

Deploying compound LLM agents in adversarial, partially observable sequential environments requires navigating several design dimensions: (1) what the agent sees, (2) how it reasons, and (3) how tasks are decomposed across components. Yet practitioners lack guidance on which design choices improve performance versus merely increase inference costs. We present a controlled study of compound LLM agent design in CybORG CAGE-2, a cyber defense environment modeled as a Partially Observable Markov Decision Process (POMDP). Reward is non-positive, so all configurations operate in a failure-mitigation mode. Our evaluation spans five model families, six models, and twelve configurations (3,475 episodes) with token-level cost accounting. We vary context representation (raw observations vs. a deterministic state-tracking layer with compressed history), deliberation (self-questioning, self-critique, and self-improvement tools, with optional chain-of-thought prompting), and hierarchical decomposition (monolithic ReAct vs. delegation to specialized sub-agents). We find that: (1) Programmatic state abstraction delivers the largest returns per token spent (RPTS), improving mean return by up to 76% over raw observations. (2) Distributing deliberation tools across a hierarchy degrades performance relative to hierarchy alone for all five model families, reaching up to 3.4$\times$ worse mean return while using 1.8-2.7$\times$ more tokens. We call this destructive pattern a deliberation cascade. (3) Hierarchical decomposition without deliberation achieves the best absolute performance for most models, and context engineering is generally more cost-effective than deliberation. These findings suggest a design principle for structured adversarial POMDPs: invest in programmatic infrastructure and clean task decomposition rather than deeper per-agent reasoning, as these strategies can interfere when combined.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper measures clear cost-performance differences for context vs deliberation vs hierarchy in one cyber POMDP and finds state abstraction wins on efficiency while extra deliberation hurts in hierarchies.

read the letter

The main thing here is that they ran a controlled comparison of twelve agent configurations in CybORG CAGE-2 and got usable numbers: programmatic state abstraction lifts mean return by up to 76% per token spent, hierarchy without deliberation tools does best in absolute performance for most of the six models, and adding self-critique or self-improvement across a hierarchy creates a deliberation cascade that cuts returns by up to 3.4 times while burning 1.8-2.7 times more tokens. They tracked everything across 3475 episodes and five model families with explicit token accounting, which is more disciplined than most agent papers.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale empirical study of compound LLM agent designs in the CybORG CAGE-2 adversarial POMDP. It systematically varies context representation (raw observations versus programmatic state abstraction with compressed history), deliberation mechanisms (self-questioning, self-critique, and self-improvement with optional CoT), and task decomposition (monolithic ReAct versus hierarchical delegation to specialized sub-agents). Evaluating 12 configurations across 6 models from 5 families in 3,475 episodes with detailed token cost tracking, the authors report that programmatic state abstraction provides the highest returns per token spent (RPTS), with up to 76% improvement in mean return over raw observations. Hierarchical decomposition without deliberation yields the best absolute performance for most models, while adding deliberation tools to the hierarchy triggers a 'deliberation cascade' resulting in up to 3.4 times worse returns at 1.8-2.7 times the token cost. The study concludes with a suggested design principle favoring programmatic infrastructure and clean decomposition over deeper reasoning in such environments.

Significance. If these empirical patterns hold, the work offers actionable insights for practitioners building LLM agents in partially observable adversarial settings, emphasizing the cost-effectiveness of state abstraction and simple hierarchies. The strengths include the controlled experimental design, explicit accounting for inference costs at the token level, and the scale of evaluation covering multiple model families. This could help shift focus from complex reasoning chains to better context engineering in agent architectures. However, the single-environment nature of the study tempers the generalizability of the proposed design principle.

major comments (2)

Abstract: The claim that the findings suggest a design principle for structured adversarial POMDPs is load-bearing for the paper's broader contribution, yet rests exclusively on results from CybORG CAGE-2 under its fixed non-positive reward and observation structure. A concrete test to address the correctness risk would be replication of the 12 configurations in at least one additional adversarial POMDP with differing state space and dynamics to check whether the RPTS ranking and deliberation cascade persist.
Results section: The reported mean return gains (up to 76%) and deliberation cascade effects (up to 3.4× worse return) are presented without variance estimates, standard errors, or statistical significance tests across the 3,475 episodes. Given stochasticity from both LLM sampling and the POMDP, this omission weakens confidence in the configuration rankings and effect sizes.

minor comments (2)

Abstract: The exact definition and computation of RPTS (e.g., whether it is the ratio of mean return to mean tokens per episode or an aggregate) is not fully specified, which affects reproducibility of the cost-performance claims.
Methods: A summary table explicitly listing all 12 configurations (combinations of context type, deliberation tools, and hierarchy level) per model would improve clarity and allow readers to map the reported outcomes directly to design choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: Abstract: The claim that the findings suggest a design principle for structured adversarial POMDPs is load-bearing for the paper's broader contribution, yet rests exclusively on results from CybORG CAGE-2 under its fixed non-positive reward and observation structure. A concrete test to address the correctness risk would be replication of the 12 configurations in at least one additional adversarial POMDP with differing state space and dynamics to check whether the RPTS ranking and deliberation cascade persist.

Authors: We appreciate the referee's point regarding the scope of our conclusions. Our work presents a controlled, large-scale study focused on the CybORG CAGE-2 environment, which is a standard benchmark for adversarial cyber defense POMDPs. We agree that the single-environment design limits broad claims, and replicating the full experimental suite in a second environment would require resources beyond a minor revision. We will therefore revise the abstract and conclusion sections to qualify the suggested design principle more precisely as being supported by evidence from this class of structured adversarial POMDPs, while explicitly noting the single-environment limitation. This maintains the contribution without overstating generalizability. revision: partial
Referee: Results section: The reported mean return gains (up to 76%) and deliberation cascade effects (up to 3.4× worse return) are presented without variance estimates, standard errors, or statistical significance tests across the 3,475 episodes. Given stochasticity from both LLM sampling and the POMDP, this omission weakens confidence in the configuration rankings and effect sizes.

Authors: We agree that reporting variance and conducting statistical tests would strengthen confidence in the results, particularly given the stochastic nature of both the LLM outputs and the environment. In the revised manuscript, we will add standard errors (or confidence intervals) to all reported mean returns and RPTS values. We will also include appropriate statistical significance tests (such as paired t-tests or non-parametric alternatives) for the primary comparisons between configurations to support the reported effect sizes and rankings. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with direct measurements

full rationale

The paper reports results from a controlled experimental study running 3,475 episodes across twelve agent configurations in the fixed CybORG CAGE-2 POMDP. All central claims (76% RPTS gain from state abstraction, deliberation cascade degrading performance, hierarchy without deliberation as best for most models) are obtained by direct measurement of return and token cost under the environment's external reward structure. No derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations exist; the findings are falsifiable by re-running the same simulator and configurations. This is the most common honest non-finding for empirical papers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central empirical claims rest on the assumption that the chosen simulator and reward function capture the relevant trade-offs; no new mathematical axioms or invented entities are introduced.

axioms (1)

domain assumption CybORG CAGE-2 constitutes a representative adversarial POMDP for evaluating LLM agent design choices.
The paper treats performance rankings observed inside this simulator as informative for the broader class of structured adversarial POMDPs.

pith-pipeline@v0.9.0 · 5898 in / 1313 out tokens · 32619 ms · 2026-05-20T18:59:29.895496+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 9 internal anchors

[1]

Elizabeth Bates, Vasilios Mavroudis, and Chris Hicks. 2023. Reward Shaping for Happier Autonomous Cyber Security Agents. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23)(Copenhagen, Den- mark). Association for Computing Machinery, New York, NY, USA, 221–232. doi:10.1145/3605764.3623916

work page doi:10.1145/3605764.3623916 2023
[2]

CardiffUni Team. 2022. CybORG CAGE-2 Winning Agent: PPO + Greedy Decoys. https://github.com/john-cardiff/-cyborg-cage-2. Accessed: 2026-04-28

work page 2022
[3]

Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A

Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A. Cardenas. 2025. Large Language Models are Autonomous Cyber Defenders. InProceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI). 1125–1132. doi:10.1109/CAI64502.2025.00195

work page doi:10.1109/cai64502.2025.00195 2025
[4]

Kim Hammar, Neil Dhir, and Rolf Stadler. 2024. Optimal Defender Strate- gies for CAGE-2 using Causal Modeling and Tree Search.arXiv(2024). arXiv:2407.11070 [cs.CR] doi:10.48550/arXiv.2407.11070

work page doi:10.48550/arxiv.2407.11070 2024
[5]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self- Correct Reasoning Yet. InInternational Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2310.01798

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01798 2024
[6]

context engineering

Andrej Karpathy. 2025. +1 for “context engineering” over “prompt engineering”. X (formerly Twitter) post. https://x.com/karpathy/status/1937902205765607626 Accessed 2026-02-22

work page arXiv 2025
[7]

Karim Ben Khaled and Davy Monticolo. 2026. G2CP: A Graph-Grounded Com- munication Protocol for Verifiable and Efficient Multi-Agent Reasoning.arXiv (2026). arXiv:2602.13370 [cs.AI] doi:10.48550/arXiv.2602.13370

work page doi:10.48550/arxiv.2602.13370 2026
[8]

Mitchell Kiely, David Bowman, Maxwell Standen, and Christopher Moir. 2023. On Autonomous Agents in a Cyber Defence Environment.arXiv(2023). arXiv:2309.07388 [cs.CR] doi:10.48550/arXiv.2309.07388

work page doi:10.48550/arxiv.2309.07388 2023
[9]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. 2025. Towards a Science of Scaling Agent Systems. arXiv(2025). arXiv:2512.08296 [cs.AI] doi:10.48550/arX...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.08296 2025
[10]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. InAdvances in Neural Information Processing Systems, Vol. 35. doi:10.48550/arXiv.2205.11916

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.11916 2022
[11]

LangChain. 2025. LangChain. Open-source software framework. https://github. com/langchain-ai/langchain Accessed 2026-02-22

work page 2025
[12]

Duc Huy Le and Rolf Stadler. 2025. Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model.arXiv(2025). arXiv:2509.06539 [cs.AI] doi:10. 48550/arXiv.2509.06539

work page arXiv 2025
[13]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17651 2023
[14]

Davis, and Mitchell Kiely

Hamoun Mohammadi, Jonathan J. Davis, and Mitchell Kiely. 2025. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations.IEEE Intelligent Systems40 (2025), 29–36. doi:10.1109/MIS.2025. 3568209

work page doi:10.1109/mis.2025 2025
[15]

& Liang, P

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 5687–5711. doi:10.18653/v1/2023. findings-emnlp.378

work page doi:10.18653/v1/2023 2023
[16]

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun

work page
[17]

InInter- national Conference on Learning Representations (ICLR)

Scaling Large Language Model-based Multi-Agent Collaboration. InInter- national Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2406. 07155

work page doi:10.48550/arxiv.2406
[18]

Matthew Renze and Erhan Guven. 2024. Self-Reflection in LLM Agents: Effects on Problem-Solving Performance.arXiv(2024). arXiv:2405.06682 [cs.AI] doi:10. 48550/arXiv.2405.06682

work page arXiv 2024
[19]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. doi:10.48550/arXiv.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023
[20]

Richer, Junae Kim, and Damian Marriott

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian Marriott. 2021. CybORG: A Gym for the Development of Autonomous Cyber Agents.arXiv(2021). arXiv:2108.09118 [cs.CR] doi:10.48550/arXiv.2108. 09118

work page doi:10.48550/arxiv.2108 2021
[21]

Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, and Qingyao Ai. 2025. Augmenting Multi-Agent Communication with State Delta Trajectory. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 10219–10240. doi:10.18653/v1/2025.emnlp-main.518

work page doi:10.18653/v1/2025.emnlp-main.518 2025
[22]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs.arXiv(2025). arXiv:2501.06322 [cs.AI] doi:10.48550/arXiv. 2501.06322

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[23]

TTCP CAGE Challenge Working Group. 2022. TTCP CAGE Challenge 2. https: //github.com/cage-challenge/cage-challenge-2 Accessed 2026-02-22

work page 2022
[24]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. doi:10.48550/arXiv.2201.11903

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903 2022
[25]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). doi:10. 48550/arXiv.2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Ur- mish Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context Engi- neering: Evolving Contexts for Self-Improving Language Models.arXiv(2025). arXiv:2510.04618 [cs.LG] doi:10.48550/arXiv.2510.04618 Appendix org...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.04618 2025
[27]

Which hosts need attention? What actions have been tried?

REVIEW SITUATION: Check network status and step history. Which hosts need attention? What actions have been tried?

work page
[28]

IDENTIFY TARGET: Select the most critical host or threat to address this step

work page
[29]

GATHER INFO: If needed, use get_analysis_of_host_update for detailed analysis of a changed host

work page
[30]

target_host

GET SUGGESTIONS: Call get_suggestion_for_next_action with JSON: {"target_host": "hostname", "situation": "description", "severity": "level", "context": "relevant history"}

work page
[31]

get_analysis_of_host_update

DECIDE: Choose ONE action from suggestions. You may override based on strategic reasoning rules: - You must select ONLY ONE action for your final Answer from the list of suggestions provided by the'get_suggestion_for_next_action'tool - Your final Answer MUST be a verbatim copy of the action-string from ONE of the suggestions - TOOLS CANNOT HANDLE MULTIPLE...

work page 2026
[32]

GET CURRENT STATE: Use get_host_current_state for the target host

work page
[33]

GET BASELINE: Use get_host_baseline_state to compare against initial state

work page
[34]

IDENTIFY ANOMALIES: What changed? New processes, connections, missing services?

work page
[35]

ASSESS SEVERITY: How critical is this compromise? Is there C2 activity?

work page
[36]

get_host_current_state

RECOMMEND ACTION: Should we contain, investigate further, or just monitor? tools: - name: "get_host_current_state" description: "Get the current state details for a specific host. The input must be a single hostname." example_calling: "get_host_current_state: Enterprise1" - name: "get_host_baseline_state" description: "Get the baseline state details for a...

work page
[37]

READ SITUATION: Check SITUATION_JSON for target_host, threat description, severity, and context

work page
[38]

types and their costs vs benefits

EVALUATE ACTIONS: Consider available action ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA Bogdanov et al. types and their costs vs benefits

work page 2026
[39]

action",

RANK THREE: Provide three suggestions with confidence scores (0.0-1.0), highest confidence first answer_format: | Your response MUST STRICTLY be a JSON array of objects, where each object represents a suggested action. Each object must have ONLY the following keys: "action", "confidence". The ActionChooser hasno tools, it is a pure generation agent that r...

work page 2026

[1] [1]

Elizabeth Bates, Vasilios Mavroudis, and Chris Hicks. 2023. Reward Shaping for Happier Autonomous Cyber Security Agents. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23)(Copenhagen, Den- mark). Association for Computing Machinery, New York, NY, USA, 221–232. doi:10.1145/3605764.3623916

work page doi:10.1145/3605764.3623916 2023

[2] [2]

CardiffUni Team. 2022. CybORG CAGE-2 Winning Agent: PPO + Greedy Decoys. https://github.com/john-cardiff/-cyborg-cage-2. Accessed: 2026-04-28

work page 2022

[3] [3]

Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A

Sebastián R. Castro, Roberto Campbell, Nancy Lau, Octavio Villalobos, Jiaqi Duan, and Alvaro A. Cardenas. 2025. Large Language Models are Autonomous Cyber Defenders. InProceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI). 1125–1132. doi:10.1109/CAI64502.2025.00195

work page doi:10.1109/cai64502.2025.00195 2025

[4] [4]

Kim Hammar, Neil Dhir, and Rolf Stadler. 2024. Optimal Defender Strate- gies for CAGE-2 using Causal Modeling and Tree Search.arXiv(2024). arXiv:2407.11070 [cs.CR] doi:10.48550/arXiv.2407.11070

work page doi:10.48550/arxiv.2407.11070 2024

[5] [5]

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. 2024. Large Language Models Cannot Self- Correct Reasoning Yet. InInternational Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2310.01798

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01798 2024

[6] [6]

context engineering

Andrej Karpathy. 2025. +1 for “context engineering” over “prompt engineering”. X (formerly Twitter) post. https://x.com/karpathy/status/1937902205765607626 Accessed 2026-02-22

work page arXiv 2025

[7] [7]

Karim Ben Khaled and Davy Monticolo. 2026. G2CP: A Graph-Grounded Com- munication Protocol for Verifiable and Efficient Multi-Agent Reasoning.arXiv (2026). arXiv:2602.13370 [cs.AI] doi:10.48550/arXiv.2602.13370

work page doi:10.48550/arxiv.2602.13370 2026

[8] [8]

Mitchell Kiely, David Bowman, Maxwell Standen, and Christopher Moir. 2023. On Autonomous Agents in a Cyber Defence Environment.arXiv(2023). arXiv:2309.07388 [cs.CR] doi:10.48550/arXiv.2309.07388

work page doi:10.48550/arxiv.2309.07388 2023

[9] [9]

Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. 2025. Towards a Science of Scaling Agent Systems. arXiv(2025). arXiv:2512.08296 [cs.AI] doi:10.48550/arX...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.08296 2025

[10] [10]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. InAdvances in Neural Information Processing Systems, Vol. 35. doi:10.48550/arXiv.2205.11916

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2205.11916 2022

[11] [11]

LangChain. 2025. LangChain. Open-source software framework. https://github. com/langchain-ai/langchain Accessed 2026-02-22

work page 2025

[12] [12]

Duc Huy Le and Rolf Stadler. 2025. Learning Optimal Defender Strategies for CAGE-2 using a POMDP Model.arXiv(2025). arXiv:2509.06539 [cs.AI] doi:10. 48550/arXiv.2509.06539

work page arXiv 2025

[13] [13]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. Self-Refine: Iterative Refinement with Self-Feedback. InAdvances in Neural Information Processi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17651 2023

[14] [14]

Davis, and Mitchell Kiely

Hamoun Mohammadi, Jonathan J. Davis, and Mitchell Kiely. 2025. Leveraging Large Language Models for Autonomous Cyber Defense: Insights from CAGE-2 Simulations.IEEE Intelligent Systems40 (2025), 29–36. doi:10.1109/MIS.2025. 3568209

work page doi:10.1109/mis.2025 2025

[15] [15]

& Liang, P

Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A. Smith, and Mike Lewis. 2023. Measuring and Narrowing the Compositionality Gap in Language Models. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 5687–5711. doi:10.18653/v1/2023. findings-emnlp.378

work page doi:10.18653/v1/2023 2023

[16] [16]

Chen Qian, Zihao Xie, YiFei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, Zhiyuan Liu, and Maosong Sun

work page

[17] [17]

InInter- national Conference on Learning Representations (ICLR)

Scaling Large Language Model-based Multi-Agent Collaboration. InInter- national Conference on Learning Representations (ICLR). doi:10.48550/arXiv.2406. 07155

work page doi:10.48550/arxiv.2406

[18] [18]

Matthew Renze and Erhan Guven. 2024. Self-Reflection in LLM Agents: Effects on Problem-Solving Performance.arXiv(2024). arXiv:2405.06682 [cs.AI] doi:10. 48550/arXiv.2405.06682

work page arXiv 2024

[19] [19]

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language Agents with Verbal Reinforcement Learning. InAdvances in Neural Information Processing Systems, Vol. 36. doi:10.48550/arXiv.2303.11366

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.11366 2023

[20] [20]

Richer, Junae Kim, and Damian Marriott

Maxwell Standen, Martin Lucas, David Bowman, Toby J. Richer, Junae Kim, and Damian Marriott. 2021. CybORG: A Gym for the Development of Autonomous Cyber Agents.arXiv(2021). arXiv:2108.09118 [cs.CR] doi:10.48550/arXiv.2108. 09118

work page doi:10.48550/arxiv.2108 2021

[21] [21]

Yichen Tang, Weihang Su, Yujia Zhou, Yiqun Liu, Min Zhang, Shaoping Ma, and Qingyao Ai. 2025. Augmenting Multi-Agent Communication with State Delta Trajectory. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 10219–10240. doi:10.18653/v1/2025.emnlp-main.518

work page doi:10.18653/v1/2025.emnlp-main.518 2025

[22] [22]

Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc-Viet Pham, Barry O’Sullivan, and Hoang D. Nguyen. 2025. Multi-Agent Collaboration Mechanisms: A Survey of LLMs.arXiv(2025). arXiv:2501.06322 [cs.AI] doi:10.48550/arXiv. 2501.06322

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025

[23] [23]

TTCP CAGE Challenge Working Group. 2022. TTCP CAGE Challenge 2. https: //github.com/cage-challenge/cage-challenge-2 Accessed 2026-02-22

work page 2022

[24] [24]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 35. doi:10.48550/arXiv.2201.11903

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903 2022

[25] [25]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR). doi:10. 48550/arXiv.2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[26] [26]

Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Ur- mish Thakker, James Zou, and Kunle Olukotun. 2025. Agentic Context Engi- neering: Evolving Contexts for Self-Improving Language Models.arXiv(2025). arXiv:2510.04618 [cs.LG] doi:10.48550/arXiv.2510.04618 Appendix org...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.04618 2025

[27] [27]

Which hosts need attention? What actions have been tried?

REVIEW SITUATION: Check network status and step history. Which hosts need attention? What actions have been tried?

work page

[28] [28]

IDENTIFY TARGET: Select the most critical host or threat to address this step

work page

[29] [29]

GATHER INFO: If needed, use get_analysis_of_host_update for detailed analysis of a changed host

work page

[30] [30]

target_host

GET SUGGESTIONS: Call get_suggestion_for_next_action with JSON: {"target_host": "hostname", "situation": "description", "severity": "level", "context": "relevant history"}

work page

[31] [31]

get_analysis_of_host_update

DECIDE: Choose ONE action from suggestions. You may override based on strategic reasoning rules: - You must select ONLY ONE action for your final Answer from the list of suggestions provided by the'get_suggestion_for_next_action'tool - Your final Answer MUST be a verbatim copy of the action-string from ONE of the suggestions - TOOLS CANNOT HANDLE MULTIPLE...

work page 2026

[32] [32]

GET CURRENT STATE: Use get_host_current_state for the target host

work page

[33] [33]

GET BASELINE: Use get_host_baseline_state to compare against initial state

work page

[34] [34]

IDENTIFY ANOMALIES: What changed? New processes, connections, missing services?

work page

[35] [35]

ASSESS SEVERITY: How critical is this compromise? Is there C2 activity?

work page

[36] [36]

get_host_current_state

RECOMMEND ACTION: Should we contain, investigate further, or just monitor? tools: - name: "get_host_current_state" description: "Get the current state details for a specific host. The input must be a single hostname." example_calling: "get_host_current_state: Enterprise1" - name: "get_host_baseline_state" description: "Get the baseline state details for a...

work page

[37] [37]

READ SITUATION: Check SITUATION_JSON for target_host, threat description, severity, and context

work page

[38] [38]

types and their costs vs benefits

EVALUATE ACTIONS: Consider available action ACM CAIS ’26, May 26–29, 2026, San Jose, CA, USA Bogdanov et al. types and their costs vs benefits

work page 2026

[39] [39]

action",

RANK THREE: Provide three suggestions with confidence scores (0.0-1.0), highest confidence first answer_format: | Your response MUST STRICTLY be a JSON array of objects, where each object represents a suggested action. Each object must have ONLY the following keys: "action", "confidence". The ActionChooser hasno tools, it is a pure generation agent that r...

work page 2026