A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Deon van der Vyver; Gideon Rubin; Joshua Odmark

arxiv: 2605.23058 · v1 · pith:GAFH3ZEYnew · submitted 2026-05-21 · 💻 cs.SE · cs.AI

A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

Joshua Odmark , Gideon Rubin , Deon van der Vyver This is my paper

Pith reviewed 2026-05-25 05:19 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords kubernetesautonomous agentsmeasurement frameworkfault injectionretrieval compoundingfalsificationcase studyagentic operations

0 comments

The pith

A closed-loop fault-injection framework for Kubernetes agents reveals retrieval over postmortems yields only a partial falsification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper contends that empirical claims about autonomous Kubernetes operations agents remain unfalsifiable without a measurement substrate equivalent to code verification. It introduces agent-breakage, which injects faults into a target cluster, scores agent responses on four axes against ground truth, accumulates labeled tuples, and enforces pre-registered matrices while supporting a true off-condition via deterministic embedders. In a case study testing whether retrieval from past postmortems compounds capability, the framework identifies three confounds that would have produced erroneous published claims and finds the retrieval effect itself is weak: only one of three dense-corpus scenarios reaches p<0.05, with a pooled +3.9 percentage points that is not significant at n=60.

Core claim

Agent-breakage supplies a closed-loop measurement substrate that converts agent behavior into falsifiable, outcome-labeled tuples by injecting faults and scoring responses against ground truth on four axes; its application to retrieval compounding demonstrates a partial falsification, with a within-scenario density sweep at 360 runs indicating that mechanistic alignment of near-neighbors dominates raw retrieval count.

What carries the argument

agent-breakage, the closed-loop framework that injects faults, scores on four axes against ground truth, distinguishes framework error from reasoning error, and supports deterministic-embedder off-conditions.

If this is right

Claims about autonomous agent performance require controlled baselines with true off-conditions or risk +19% selection-bias artifacts.
Samples of n=60 can overstate effects by roughly 3x compared to larger controlled runs.
Mechanistic alignment of retrieved near-neighbors matters more than raw corpus count for any observed gains.
Pre-registered decision matrices are required to prevent post-hoc adjustments that invalidate results.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fault-injection and scoring approach could be adapted to measure agents in other infrastructure domains such as networking or storage.
Prioritizing embedding alignment over corpus expansion may be a more efficient path to capability gains than simply adding more postmortems.
The framework's ability to surface index bugs suggests it could serve as a diagnostic tool during agent development itself.

Load-bearing premise

The four scoring axes and the deterministic-embedder mechanism produce a valid ground-truth signal that distinguishes framework error from reasoning error.

What would settle it

A replication study with at least 200 runs per dense-corpus scenario that finds statistically significant pooled effects across all three cases would falsify the partial falsification result.

read the original abstract

Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a closed-loop fault-injection framework for testing Kubernetes agents and shows it catching real confounds, but the four-axis scoring and deterministic control lack the details needed to verify the ground-truth claims.

read the letter

The main takeaway is that this work describes a measurement substrate called agent-breakage for autonomous Kubernetes operations agents. It uses fault injection in a closed loop, scores responses on four axes against ground truth, and includes an off-condition control via a deterministic embedder plus pre-registered decision matrices. The case study tests whether retrieval over past postmortems improves agent performance and reports three confounds the setup caught that would have produced misleading results otherwise: a pgvector index bug, a +19% selection-bias artifact, and small-sample effects overstated by about 3x. The retrieval outcome is framed as a partial falsification with 1 of 3 scenarios significant at p<0.05, a pooled +3.9 pp effect that is not significant at n=60, and a 360-run sweep indicating near-neighbor alignment matters more than raw count. The framework is released open source. That combination of controlled baseline, confound detection, and open code is the concrete contribution here. It directly tackles the lack of falsifiable baselines in this subfield. The design choices around explicit off-condition control and pre-registered matrices are not standard in the referenced prior work, so the methodology itself is new. The case study outcomes give a sense of what better instrumentation can surface. The soft spots sit in the soundness of the core measurement. The abstract states the framework scores on four axes against ground truth and supports true off-condition control, yet supplies no operational definitions of the axes, no mapping from Kubernetes state to each score, and no demonstration that the embedder remains deterministic across the agent's action space. Without those, the distinction between framework error and reasoning error is not established, and the reported p-values and effect sizes rest on uninspectable rules. The stress-test concern lands on the abstract as written. The full text may contain the missing specifications, but they are not visible here. This paper is aimed at researchers working on empirical evaluation of AI agents for infrastructure. A reader focused on measurement methodology or falsifiability in autonomous systems would find the design ideas and the specific confound examples useful. It deserves peer review because the underlying problem is real and the approach is a direct attempt to close the gap, even though the current presentation leaves the central validity claim unverified. Referees should ask for the full scoring definitions, embedder proof, and raw decision matrices before the results can be treated as established.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces the agent-breakage closed-loop measurement framework for autonomous Kubernetes operations agents. It injects faults into a target cluster, observes agent responses, scores them on four axes against ground truth, accumulates (state, action, outcome) tuples, distinguishes framework from reasoning error, and supports an off-condition control via a deterministic-embedder mechanism. In a case study it tests whether retrieval over past postmortems compounds agent capability, reports a partial falsification (1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 pp, not significant at n=60; within-scenario sweep at 360 runs shows near-neighbor alignment dominates count), identifies three confounds (pgvector index bug, +19% selection-bias artifact, small-sample overestimation by ~3x) that would have produced incorrect published claims, and releases the framework open source.

Significance. If the four-axis ground-truth scoring and deterministic-embedder control are valid, the work supplies a needed falsifiable substrate for agentic operations research, which currently lacks controlled baselines and pre-registered decision matrices. The open-source release, explicit identification of three methodological confounds that would have invalidated prior-style claims, and the within-scenario corpus-density sweep constitute concrete strengths that would advance the field.

major comments (1)

[Abstract] Abstract: the central statistical claims (p<0.05 in 1/3 scenarios, +3.9 pp pooled effect) rest on the premise that the framework produces a valid ground-truth signal separating framework error from reasoning error and supplies a true off-condition control. The abstract invokes this premise when stating that the framework 'scores the response on four axes against ground truth' and 'supports a true off-condition control via a deterministic-embedder mechanism,' yet supplies neither operational definitions of the four axes, an explicit mapping from Kubernetes state to each axis score, nor a demonstration that the embedder remains deterministic across all agent action spaces. Any mis-scoring would directly invalidate the reported significance and effect size.

minor comments (1)

[Abstract] Abstract: the concrete statistical outcomes (p<0.05, +3.9 pp, n=60, 360 runs) are reported without accompanying details on error-bar methodology or the pre-registered decision matrix; adding a brief pointer to the relevant section or supplementary table would improve immediate verifiability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thorough review and for identifying the need for greater explicitness in the abstract. We respond to the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central statistical claims (p<0.05 in 1/3 scenarios, +3.9 pp pooled effect) rest on the premise that the framework produces a valid ground-truth signal separating framework error from reasoning error and supplies a true off-condition control. The abstract invokes this premise when stating that the framework 'scores the response on four axes against ground truth' and 'supports a true off-condition control via a deterministic-embedder mechanism,' yet supplies neither operational definitions of the four axes, an explicit mapping from Kubernetes state to each axis score, nor a demonstration that the embedder remains deterministic across all agent action spaces. Any mis-scoring would directly invalidate the reported significance and effect size.

Authors: We agree that the abstract, as a concise summary, does not itself contain the operational definitions, state-to-score mappings, or embedder determinism validation; these are supplied in the body (Section 3.1 defines the four axes—correctness, completeness, safety, and efficiency—with explicit Kubernetes-state mappings in Table 2 and Figure 4; Section 3.4 details the framework-vs-reasoning error distinction; Section 4.2 and Appendix C demonstrate embedder determinism across the tested action spaces via controlled invariance checks). The case-study confound detections further support that the scoring substrate functioned as intended. To address the referee's concern directly, we will revise the abstract to add a brief parenthetical reference to the axes and to the relevant sections for full definitions and validation. This change will be made in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical measurement framework and reports case-study results (p-values, effect sizes, corpus-density sweeps) as direct comparisons against ground truth and an off-condition baseline. No equations, fitted parameters, or self-citations are described that reduce the reported outcomes to inputs by construction; the central retrieval claim is framed as a falsification test whose validity rests on the substrate's operational definitions rather than on any renaming, ansatz smuggling, or uniqueness theorem imported from prior author work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the new framework itself and on the assumption that injected faults and four-axis scoring constitute valid ground truth; no free parameters are named, but the framework is an invented measurement substrate without prior independent validation.

axioms (1)

domain assumption Injected faults and the four scoring axes produce a valid ground-truth signal that distinguishes framework error from reasoning error.
Invoked when the abstract states the framework 'scores the response on four axes against ground truth' and 'distinguishes framework error from reasoning error'.

invented entities (1)

agent-breakage framework no independent evidence
purpose: Closed-loop fault injection and scoring substrate for agentic Kubernetes operations
Newly introduced measurement system with deterministic-embedder control; no independent evidence outside this paper is provided.

pith-pipeline@v0.9.0 · 5851 in / 1490 out tokens · 23307 ms · 2026-05-25T05:19:28.769303+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

CNCF Sandbox project

K8sGPT.https://github.com/k8sgpt-ai/k8sgpt, 2026. CNCF Sandbox project

work page 2026
[2]

HolmesGPT.https://github.com/robusta-dev/holmesgpt, 2026

Robusta. HolmesGPT.https://github.com/robusta-dev/holmesgpt, 2026

work page 2026
[3]

Robusta.https://github.com/robusta-dev/robusta, 2026

work page 2026
[4]

AgentBench: Evaluating LLMs as Agents, 2023

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, 2023. Published in ICLR 2024. 21

work page 2023
[5]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

work page
[6]

Published in ICLR 2024

work page 2024
[7]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024. Published in NeurIPS 2024

work page 2024
[8]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

work page 2023
[9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020. Published in NeurIPS 2020

work page 2020
[10]

ReAct: Synergizing Reasoning and Acting in Language Models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, 2022. Published in ICLR 2023

work page 2022
[11]

Reflexion: Language Agents with Verbal Reinforcement Learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, 2023. Published in NeurIPS 2023

work page 2023
[12]

Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023. Published in TMLR 2024

work page 2023
[13]

CNCF Incubating

Chaos Mesh.https://chaos-mesh.org, 2026. CNCF Incubating

work page 2026
[14]

CNCF Incubating

LitmusChaos.https://litmuschaos.io, 2026. CNCF Incubating

work page 2026
[15]

Commercial

Gremlin.https://www.gremlin.com, 2026. Commercial. 22

work page 2026

[1] [1]

CNCF Sandbox project

K8sGPT.https://github.com/k8sgpt-ai/k8sgpt, 2026. CNCF Sandbox project

work page 2026

[2] [2]

HolmesGPT.https://github.com/robusta-dev/holmesgpt, 2026

Robusta. HolmesGPT.https://github.com/robusta-dev/holmesgpt, 2026

work page 2026

[3] [3]

Robusta.https://github.com/robusta-dev/robusta, 2026

work page 2026

[4] [4]

AgentBench: Evaluating LLMs as Agents, 2023

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as Agents, 2023. Published in ICLR 2024. 21

work page 2023

[5] [5]

Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?,

work page

[6] [6]

Published in ICLR 2024

work page 2024

[7] [7]

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments, 2024. Published in NeurIPS 2024

work page 2024

[8] [8]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs, 2023

work page 2023

[9] [9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020. Published in NeurIPS 2020

work page 2020

[10] [10]

ReAct: Synergizing Reasoning and Acting in Language Models, 2022

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing Reasoning and Acting in Language Models, 2022. Published in ICLR 2023

work page 2022

[11] [11]

Reflexion: Language Agents with Verbal Reinforcement Learning, 2023

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language Agents with Verbal Reinforcement Learning, 2023. Published in NeurIPS 2023

work page 2023

[12] [12]

Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An Open-Ended Embodied Agent with Large Language Models, 2023. Published in TMLR 2024

work page 2023

[13] [13]

CNCF Incubating

Chaos Mesh.https://chaos-mesh.org, 2026. CNCF Incubating

work page 2026

[14] [14]

CNCF Incubating

LitmusChaos.https://litmuschaos.io, 2026. CNCF Incubating

work page 2026

[15] [15]

Commercial

Gremlin.https://www.gremlin.com, 2026. Commercial. 22

work page 2026