pith. machine review for the scientific record. sign in

arxiv: 2604.20938 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI

Recognition: unknown

HARBOR: Automated Harness Optimization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:37 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords harness optimizationBayesian optimizationlanguage model agentsautomated configuration searchagent developmentnoisy optimizationmulti-fidelity methodscoding agents
0
0 comments X

The pith

Treating language-model agent harness design as a machine-learning optimization problem allows automated search to outperform manual tuning for large flag spaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the harness wrapping a language model agent, which includes context compaction, tool caching, and memory mechanisms, is the dominant source of complexity and should be optimized automatically rather than by hand. It formalizes this as a constrained noisy Bayesian optimization problem over mixed variables with cost and safety considerations, and introduces the HARBOR solver to perform the search. In a case study with a production coding agent, the automated run surpasses a controlled manual-tuning effort on a fixed task suite. If true, this shifts agent development from iterative human engineering to systematic search, making performance gains more reproducible across different base models. The method is general for any bounded-flag agent setup with reproducible tasks.

Core claim

Harness design is a first-class machine-learning problem. Automated configuration search using constrained noisy Bayesian optimization dominates manual stacking once the flag space exceeds a handful of bits. We provide the HARBOR reference solver based on a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions, and validate it in an end-to-end run against manual tuning on a coding-agent task suite.

What carries the argument

The HARBOR solver, which performs constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space using a block-additive SAAS surrogate, multi-fidelity acquisition, and trust regions, with cold-start-corrected rewards and posterior chance-constrained safety checks.

If this is right

  • The automated method will locate higher-reward harness configurations than manual tuning when the number of flags grows beyond a small set.
  • The same optimization framework applies to any agent harness with a bounded flag space and reproducible task suite.
  • Cost-aware acquisition and safety checks keep the search efficient and prevent unsafe states during optimization.
  • Cold-start corrections improve reward estimation for configurations that have not been fully evaluated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent developers could integrate this optimization into training loops to continuously refine harnesses alongside model updates.
  • Similar automated approaches might improve configuration of other complex systems like reinforcement learning environments or data processing pipelines.
  • If successful, it reduces reliance on expert prompt engineering and harness crafting, potentially democratizing high-performance agent design.

Load-bearing premise

The configuration space remains bounded and the task suite is fully reproducible, allowing the optimization procedure to locate high-reward configurations without prohibitive cost or unsafe states.

What would settle it

A direct comparison on the same production coding agent and task suite where the best manual configuration after equivalent evaluation effort matches or exceeds the HARBOR result would falsify the claim that automated search dominates.

Figures

Figures reproduced from arXiv: 2604.20938 by Biswa Sengupta, Jinhua Wang.

Figure 1
Figure 1. Figure 1: HARBOR schematic. A flag-space×fidelity input (x, m) is scored by a block-additive SAAS surrogate; a cost-aware acquisition selects the next batch; runtime telemetry (dashed) feeds Axis-IV silent-flag auto-exclusion, which prunes dimen￾sions from the surrogate in place. The output is a Pareto front on (pass-rate, cost), from which a single deployment config￾uration is picked subject to the posterior chance… view at source ↗
Figure 2
Figure 2. Figure 2: Reference HARBOR solver for harness configuration. µ(c), σ(c) denote GP posterior mean and standard deviation; Tm is the fidelity-m task subset; α 2 ℓ is the per-block kernel scale of Eq. 6; nℓ is the number of distinct block-ℓ projections seen in history H. signal—the rest are either silent (Axis-IV detector, §VII-0g) or dominated by telemetry-level integration failures. SAAS con￾centrates a half-Cauchy p… view at source ↗
read the original abstract

Long-horizon language-model agents are dominated, in lines of code and in operational complexity, not by their underlying model but by the harness that wraps it: context compaction, tool caching, semantic memory, trajectory reuse, speculative tool prediction, and the glue that binds the model to a sandboxed execution environment. We argue that harness design is a first-class machine-learning problem and that automated configuration search dominates manual stacking once the flag space exceeds a handful of bits. We defend this claim in two steps. First, we formalize automated harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous configuration space with cold-start-corrected rewards and a posterior chance-constrained safety check, and give a reference solver, HARBOR (Harness Axis-aligned Regularized Bayesian Optimization Routine), built from a block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, and TuRBO trust regions. Second, we instantiate the problem in a flag-gated harness over a production coding agent and report a controlled four-round manual-tuning case study against a fixed task suite and an end-to-end HARBOR run. The formulation itself is task-class agnostic: the configuration space, reward correction, acquisition, and safety check apply to any agent harness with a bounded flag space and a reproducible task suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that harness design for long-horizon language-model agents is a first-class machine-learning problem best solved by automated configuration search. It formalizes harness optimization as constrained noisy Bayesian optimization over a mixed-variable, cost-heterogeneous space with cold-start-corrected rewards and a posterior chance-constrained safety check, introduces the HARBOR solver (block-additive SAAS surrogate, multi-fidelity cost-aware acquisition, TuRBO trust regions), and supports the claim via a controlled case study contrasting four-round manual tuning against an end-to-end HARBOR run on a production coding-agent harness with a fixed task suite. The formulation is presented as task-class agnostic for any bounded flag space and reproducible task suite.

Significance. If the case study were augmented with quantitative metrics showing clear outperformance in flag spaces beyond a small number of bits, along with reproducibility details, the work could meaningfully shift agent development practices toward systematic optimization rather than manual stacking. The task-agnostic formalization, reference solver components, and emphasis on safety and cost heterogeneity are positive contributions that could be adopted if empirically grounded.

major comments (2)
  1. [case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.
  2. [abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.
minor comments (1)
  1. [formalization] The description of the posterior chance-constrained safety check would benefit from an explicit equation or pseudocode to clarify its integration with the acquisition function.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to strengthen the empirical details and reproducibility of the case study.

read point-by-point responses
  1. Referee: [case study] Case study description (and abstract): the central claim that automated search dominates manual stacking once the flag space exceeds a handful of bits cannot be assessed because the manuscript supplies neither the exact cardinality nor bit-width of the flag space in the production coding-agent harness, nor evidence that the four-round manual baseline was driven to convergence or compared against alternative manual strategies.

    Authors: We agree that the exact cardinality and bit-width of the flag space must be stated explicitly for the central claim to be assessable. The revised manuscript will report the precise configuration space of the production coding-agent harness, including the number of flags, their types and domains, and the resulting search-space cardinality. We will also expand the case-study description to detail the manual-tuning protocol, including the specific steps taken across the four rounds, observed performance trends that indicate diminishing returns, and a brief discussion of why alternative manual strategies were not exhaustively benchmarked. revision: yes

  2. Referee: [abstract and case study] Abstract and case study: the soundness of the empirical claim is undermined by the absence of any quantitative results, error bars, reward values, convergence data, or detailed validation metrics comparing HARBOR to the manual baseline, leaving the performance advantage unmeasurable.

    Authors: We concur that quantitative metrics are required to substantiate the performance advantage. The revised version will augment both the abstract and the case-study section with the missing numerical results, including per-round reward values, standard-error bars where multiple runs are available, convergence curves, and explicit validation metrics that directly compare the HARBOR run against the manual baseline on the fixed task suite. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper formalizes harness optimization as constrained noisy Bayesian optimization with a block-additive SAAS surrogate and multi-fidelity acquisition, then reports an empirical case study contrasting four-round manual tuning against an end-to-end HARBOR run on a fixed task suite. No equations, predictions, or first-principles results reduce by construction to fitted inputs or self-citations; the configuration space, reward correction, and safety check are defined independently of the specific case-study outcomes. The dominance claim is defended by the case study rather than by re-deriving the same quantities from the optimization itself. The formulation is explicitly task-agnostic and does not rely on load-bearing self-citations or ansatzes smuggled from prior author work. This is the normal, non-circular outcome for a methods-plus-case-study paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that harness configuration can be treated as a bounded, noisy, mixed-variable optimization problem amenable to Bayesian methods.

axioms (1)
  • domain assumption Harness configuration space is mixed-variable, cost-heterogeneous, and bounded with a reproducible task suite
    Invoked when formalizing the problem and when reporting the case study.

pith-pipeline@v0.9.0 · 5518 in / 1117 out tokens · 25466 ms · 2026-05-10T00:37:02.699336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Agentic Code Optimization via Compiler-LLM Cooperation

    Mikek, B., Vashchilenko, D., Lu, B., and Xu, P. Agentic Code Optimization via Compiler-LLM Cooperation.arXiv preprint arXiv:2604.04238, 2026

  2. [2]

    ACON: Optimizing context compression for long-horizon LLM agents.arXiv preprint arXiv:2510.00615, 2025

    Kang, M., Chen, W.-N., Han, D., Inan, H. A., Wutschitz, L., Chen, Y ., Sim, R., and Rajmohan, S. ACON: Optimizing Context Compression for Long-horizon LLM Agents.arXiv preprint arXiv:2510.00615, 2025

  3. [3]

    2025 Was Agents

    Gupta, A. 2025 Was Agents. 2026 Is Agent Harnesses.Medium, 2026

  4. [4]

    R., Daulton, S., Letham, B., Wilson, A

    Balandat, M., Karrer, B., Jiang, D. R., Daulton, S., Letham, B., Wilson, A. G., and Bakshy, E. BoTorch: A Framework for Efficient Monte-Carlo Bayesian Optimization.NeurIPS, 2020

  5. [5]

    Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026

    Datadog Engineering. Closing the Verification Loop: Observability-Driven Harnesses for Building with Agents (Bit- sEvolve).Datadog Engineering Blog, 2026

  6. [6]

    arXiv preprint arXiv:2509.11079 , year=

    Su, J., Lan, Q., Xia, Y ., Sun, L., Tian, W., Shi, T., Song, X., He, L., and Jingsong, Y . Difficulty-Aware Agentic Orchestration for Query-Specific Multi-Agent Workflows.arXiv preprint arXiv:2509.11079, 2025

  7. [7]

    Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems

    Liu, J., Zhao, X., Shang, X., and Shen, Z. Dive into Claude Code: The Design Space of Today’s and Future AI Agent Systems. arXiv preprintarXiv:2604.14228, 2026

  8. [8]

    RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026

    Atlan Engineering. RAGAS, TruLens, DeepEval: LLM Evalua- tion Frameworks Compared.Atlan Engineering Blog, 2026

  9. [9]

    R., Turner, R

    Eriksson, D., Pearce, M., Gardner, J. R., Turner, R. D., and Poloczek, M. Scalable Global Optimization via Local Bayesian Optimization.NeurIPS, 2019

  10. [10]

    and Jankowiak, M

    Eriksson, D. and Jankowiak, M. High-Dimensional Bayesian Optimization with Sparse Axis-Aligned Subspaces.UAI, 2021

  11. [11]

    A., and Bakshy, E

    Daulton, S., Wan, X., Eriksson, D., Balandat, M., Osborne, M. A., and Bakshy, E. Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization.NeurIPS, 2022

  12. [12]

    Garrido-Merchán, E. C. and Hernández-Lobato, D. Dealing with Categorical and Integer-Valued Variables in Bayesian Opti- mization with Gaussian Processes.Neurocomputing, 380:20–35, 2020

  13. [13]

    I., and Wilson, A

    Wu, J., Toscano-Palmerin, S., Frazier, P. I., and Wilson, A. G. Practical Multi-fidelity Bayesian Optimization for Hyperparam- eter Tuning.UAI, 2020

  14. [14]

    Duvenaud, D., Nickisch, H., and Rasmussen, C. E. Additive Gaussian Processes.NIPS, 2011

  15. [15]

    Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017

    Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. Google Vizier: A Service for Black-Box Opti- mization.KDD, 2017

  16. [16]

    LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024

    EleutherAI. LM Evaluation Harness.GitHub, EleutherAI/lm- evaluation-harness, 2024

  17. [17]

    BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018

    Falkner, S., Klein, A., and Hutter, F. BOHB: Robust and Efficient Hyperparameter Optimization at Scale.ICML, 2018

  18. [18]

    Ding, D., Mallick, A., Wang, C., Sim, R., Mukherjee, S., Rühle, V ., Lakshmanan, L. V . S., and Awadallah, A. H. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing.ICLR, 2024

  19. [19]

    Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026

    Lopopolo, R. Harness Engineering: Leveraging Codex in an Agent-First World.OpenAI Research Blog, 2026

  20. [20]

    Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval

    Shakir, A., Aarsen, T., and Lee, S. Binary and Scalar Embedding Quantization for Significantly Faster and Cheaper Retrieval. Hugging Face Blog, 2024

  21. [21]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024

    Jimenez, C., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. SWE-bench: Can Language Models Resolve Real-World GitHub Issues?ICLR, 2024

  22. [22]

    Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023

    Khot, T., Trivedi, H., Finlayson, M., Fu, Y ., Richardson, K., Clark, P., and Sabharwal, A. Decomposed Prompting: A Modular Approach for Solving Complex Tasks.ICLR, 2023

  23. [23]

    Matryoshka Representation Learning.NeurIPS, 2022

    Kusupati, A., Bhatt, G., Rege, A., Wallingford, M., Sinha, A., Ra- manujan, V ., Howard-Snyder, W., Chen, K., Kakade, S., Jain, P., and Farhadi, A. Matryoshka Representation Learning.NeurIPS, 2022

  24. [24]

    Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018

    Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Tal- walkar, A. Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization.JMLR, 18(185):1–52, 2018

  25. [25]

    Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025

    LMCache Team. Context Engineering Reuse Patterns: Under the Hood of Claude Code.LMCache Engineering Blog, 2025

  26. [26]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Lee, Y ., Nair, R., Zhang, Q., Lee, K., Khattab, O., and Finn, C. Meta-Harness: End-to-End Optimization of Model Harnesses. arXiv preprintarXiv:2603.28052, 2026

  27. [27]

    Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026

    OpenTelemetry Specification Working Group. Semantic Con- ventions for Generative AI Systems (v1.37).OpenTelemetry Specification, 2026

  28. [28]

    Pancake: Hierarchical Mem- ory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477, 2026

    Hu, Z., Pan, Z., Kaur, P., Murthy, V ., Yu, Z., Guan, Y ., Wang, Z., Swanson, S., and Ding, Y . Pancake: Hierarchical Mem- ory System for Multi-Agent LLM Serving.arXiv preprint arXiv:2602.21477, 2026

  29. [29]

    Act while thinking: Accelerating llm agents via pattern-aware speculative tool execution.arXiv preprint arXiv:2603.18897, 2026

    Sui, Y ., Zhao, H., Ma, R., He, Z., Wang, H., Li, J., and Yang, Y . Act While Thinking: Accelerating LLM Agents via Pattern-Aware Speculative Tool Execution.arXiv preprint arXiv:2603.18897, 2026

  30. [30]

    On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022

    Ram, S., Zhang, S., Gong, J., and Roth, D. On the Optimality Gap of Warm-Started Hyperparameter Optimization.Transac- tions on Machine Learning Research (TMLR), 2022

  31. [31]

    Sarthi, P., Abdullah, S., Tuli, A., Khanna, S., Goldie, A., and Manning, C. D. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval.ICLR, 2024

  32. [32]

    Liu, X., Atalar, B., Dai, X., Zuo, J., Wang, S., Lui, J. C. S., Chen, W., and Joe-Wong, C. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. Proceedings of INFOCOM, 2026

  33. [33]

    Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023

    Schreiter, J., Nguyen-Tuong, D., and Toussaint, M. Safe Risk- Averse Bayesian Optimization for Controller Tuning.arXiv preprintarXiv:2306.13479, 2023

  34. [34]

    Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023

    Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning.NeurIPS, 2023

  35. [35]

    Wang, H., Feng, Y ., Cao, Y ., Xie, X., and Zhou, S. K. SkewRoute: Training-Free LLM Routing for Knowledge Graph Retrieval-Augmented Generation via Score Skewness of Re- trieved Context.arXiv preprintarXiv:2505.23841, 2025

  36. [36]

    Snoek, J., Larochelle, H., and Adams, R. P. Practical Bayesian Optimization of Machine Learning Algorithms.NeurIPS, 2012

  37. [37]

    Swersky, K., Snoek, J., and Adams, R. P. Freeze-Thaw Bayesian Optimization.arXiv preprintarXiv:1406.3896, 2014

  38. [38]

    Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026

    Laude Institute and Stanford University. Terminal-Bench 2.0: A Benchmark for Agents in Terminal Environments.Open-source benchmark; https://www.tbench.ai/, 2026

  39. [39]

    How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026

    Krafton AI Research. How We Reached 74.8% on Terminal- Bench with Terminus-KIRA: Harness Fixes That Matter.Engi- neering Blog, 2026

  40. [40]

    From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python

    Wang, J. and Sengupta, B. From Translation to Superset: Benchmark-Driven Evolution of a Production AI Agent from Rust to Python.arXiv preprintarXiv:2604.11518, 2026

  41. [41]

    A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026

    Watershed Engineering. A Practical Framework for LLM System Evaluations for Multi-Step Processes.Watershed Engineering Blog, 2026

  42. [42]

    Wilson, E. B. Probable Inference, the Law of Succession, and Statistical Inference.Journal of the American Statistical Association, 22(158):209–212, 1927

  43. [43]

    Turboquant: Online vector quantization with near-optimal distortion rate,

    Zandieh, A., Daliri, M., Hadian, M., and Mirrokni, V . Turbo- Quant: Online Vector Quantization with Near-Optimal Distortion © 2026 JP Morgan Chase & Co. All rights reserved 13 Rate.arXiv preprintarXiv:2504.19874, 2025. © 2026 JP Morgan Chase & Co. All rights reserved 14