pith. sign in

arxiv: 2605.23950 · v1 · pith:7JLDUN4Fnew · submitted 2026-05-07 · 💻 cs.AI · cs.SE

Stop Comparing LLM Agents Without Disclosing the Harness

Pith reviewed 2026-06-30 23:13 UTC · model grok-4.3

classification 💻 cs.AI cs.SE
keywords LLM agentsevaluation harnessperformance varianceBinding Constraint Thesislong-horizon tasksagent benchmarksevaluation protocols
0
0 comments X

The pith

The execution harness often determines LLM agent performance more than the model it wraps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This position paper claims that for long-horizon tasks evaluated across models of comparable frontier capability, the agent execution harness—the infrastructure handling context construction, tool interaction, orchestration, and verification—frequently exerts a stronger influence on outcomes than the language model itself. It formalizes this observation as the Binding Constraint Thesis, under which performance variance stems more from harness configuration choices than from model selection, causing evaluations to credit models for gains that actually arise at the harness level. Support comes from a control-theoretic framing of the harness as controller and the model as policy, from empirical variance decompositions across benchmarks and deployments that show harness effects can exceed and even reverse model effects, and from a proposed disclosure standard plus decomposition protocol. A sympathetic reader would care because the thesis implies that many existing agent comparisons isolate neither model nor harness effects cleanly.

Core claim

The Binding Constraint Thesis states that, for long-horizon tasks with comparable frontier models, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. The thesis is defended by treating the harness as the controller of a closed-loop dynamical system whose LLM component functions as the governed stochastic policy, by showing through benchmark analysis and variance decomposition that harness-induced variance can substantially exceed model-induced variance including cases of ranking reversal, and by outlining a harness-aware evaluation framework that

What carries the argument

The Binding Constraint Thesis, which treats the harness as controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs.

If this is right

  • Current leaderboard comparisons of long-horizon agents are incomplete and potentially misleading without harness disclosure.
  • Harness configuration changes can produce performance shifts larger than those obtained by substituting one model for another.
  • Model rankings can reverse when the same models are evaluated under different harnesses.
  • A standardized disclosure requirement and variance decomposition protocol are needed to isolate model contributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent research may benefit more from systematic harness tuning than from repeated model swaps in the current regime.
  • The same confounding pattern could appear in evaluations of other composite AI systems where supporting layers are not reported.
  • Requiring harness disclosure would allow future meta-analyses to quantify the relative contribution of infrastructure versus model across published results.

Load-bearing premise

The control-theoretic view that treats the harness as controller and the LLM as policy accurately explains why small harness changes can exceed model substitution effects.

What would settle it

A controlled study that fixes one harness across several frontier models and measures model-induced variance, then fixes one model across several harness configurations and measures harness-induced variance, showing the latter does not exceed the former.

Figures

Figures reproduced from arXiv: 2605.23950 by Chandan K. Reddy, Janet Wang, Jihun Hamm, Weijie Xu, Yingqiang Ge, Yunbei Zhang.

Figure 1
Figure 1. Figure 1: From inference to control. The inference framing (left) treats the agent as a model in a while-loop, attributing performance to πθ. The closed-loop framing (right) makes the harness the controller CH: stability V (st), context drift δt, and control lag τ are controller properties, the model is open-loop, and the harness closes the loop. protocol, where harness choice is varied as a controlled factor and va… view at source ↗
read the original abstract

This position paper argues that, for long-horizon tasks evaluated across models with comparable frontier capability, the agent execution harness, namely the infrastructure layer that governs context construction, tool interaction, orchestration, and verification around a language model, is often a stronger determinant of agent performance than the model it wraps. We formalize and defend the Binding Constraint Thesis: in this regime, performance variance is governed more by harness configuration than by model choice, and current evaluation protocols therefore systematically misattribute harness-level gains to model improvements. We support this thesis along three lines. First, a control-theoretic formalization treats the harness as the controller of a closed-loop dynamical system and the LLM as the stochastic policy it governs, which explains why small harness changes can produce performance shifts that exceed those obtained by substituting one model for another. Second, published benchmarks, industry deployments, and a controlled variance decomposition show that harness-induced variance can substantially exceed model-induced variance, including cases of model ranking reversal. Third, we propose a harness-aware evaluation framework with a disclosure standard and a variance decomposition protocol. Until harness specifications are disclosed, leaderboard comparisons for long-horizon agents should be treated as incomplete and potentially misleading.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This position paper argues that for long-horizon LLM agent tasks evaluated across models with comparable frontier capability, the agent execution harness (infrastructure governing context construction, tool interaction, orchestration, and verification) is often a stronger determinant of performance than the model itself. It formalizes the Binding Constraint Thesis—that performance variance is governed more by harness configuration than model choice, causing current evaluation protocols to misattribute harness-level gains to model improvements—and supports the thesis via a control-theoretic formalization (harness as closed-loop controller, LLM as stochastic policy), references to published benchmarks/industry deployments/variance decompositions showing harness variance exceeding model variance (including ranking reversals), and a proposed harness-aware evaluation framework with disclosure standards and variance decomposition protocol.

Significance. If the Binding Constraint Thesis holds, the result would be significant for LLM agent evaluation practices, indicating that leaderboards are systematically incomplete without harness disclosure and that variance decomposition protocols could yield more reliable model comparisons. The control-theoretic framing, if made rigorous, offers a potentially useful lens for explaining why small harness changes can dominate model substitution effects.

major comments (2)
  1. [control-theoretic formalization (first support line)] The control-theoretic formalization (first line of support in the abstract): the argument is presented at a high level only and invokes standard control models without addressing how they accommodate the discrete, non-stationary state transitions, context truncation, tool failures, and multi-turn memory that characterize LLM agent execution. Absent explicit assumptions on observability and stationarity, this does not establish that harness effects are binding in the claimed regime or explain performance shifts exceeding model substitution.
  2. [empirical support (second support line)] The empirical support via published benchmarks, industry deployments, and controlled variance decomposition (second line of support in the abstract): the manuscript supplies no new data, derivations, or error analysis and does not report specific variance numbers, decomposition results, or the protocol details, making it impossible to assess the claim that harness-induced variance substantially exceeds model-induced variance or produces ranking reversals.
minor comments (2)
  1. [introduction] The definition of 'harness' appears in the abstract but would benefit from an operational definition with concrete components (e.g., context window management, tool-calling loop) in the introduction to ground subsequent claims.
  2. [proposed framework] The proposed harness-aware evaluation framework is outlined at the end of the abstract but lacks a concrete example of the disclosure standard or variance decomposition protocol, which would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight areas where the position paper's arguments can be strengthened. We address each major comment below, indicating planned revisions where appropriate. As this is a position paper, our responses focus on clarifying the scope and enhancing rigor without introducing new empirical work.

read point-by-point responses
  1. Referee: [control-theoretic formalization (first support line)] The control-theoretic formalization (first line of support in the abstract): the argument is presented at a high level only and invokes standard control models without addressing how they accommodate the discrete, non-stationary state transitions, context truncation, tool failures, and multi-turn memory that characterize LLM agent execution. Absent explicit assumptions on observability and stationarity, this does not establish that harness effects are binding in the claimed regime or explain performance shifts exceeding model substitution.

    Authors: We agree the formalization is presented at a high level. In revision, we will expand the relevant section to specify assumptions on partial observability (harness observes tool outputs and context state) and non-stationarity (due to context truncation and evolving task state), and explain accommodation of discrete transitions, tool failures, and multi-turn memory via the harness's closed-loop orchestration. This will more rigorously link the controller-policy framing to performance shifts exceeding model substitution in the long-horizon regime. revision: yes

  2. Referee: [empirical support (second support line)] The empirical support via published benchmarks, industry deployments, and controlled variance decomposition (second line of support in the abstract): the manuscript supplies no new data, derivations, or error analysis and does not report specific variance numbers, decomposition results, or the protocol details, making it impossible to assess the claim that harness-induced variance substantially exceeds model-induced variance or produces ranking reversals.

    Authors: As a position paper, the manuscript synthesizes evidence from existing published benchmarks and deployments rather than presenting new data or derivations. We will revise to extract and report specific variance numbers, decomposition results, and protocol details from the cited sources (including explicit references to ranking reversal cases), enabling readers to evaluate the relative magnitudes of harness vs. model variance. revision: partial

Circularity Check

0 steps flagged

No significant circularity; thesis supported by external benchmarks and controlled experiments

full rationale

The paper's Binding Constraint Thesis is supported along three lines: a control-theoretic formalization (explanatory analogy, not a fitted prediction), published benchmarks and industry deployments (external references), and a controlled variance decomposition (empirical protocol). No step reduces a claimed result to a self-defined quantity, a fitted input renamed as prediction, or a load-bearing self-citation chain. The derivation remains self-contained against external evidence and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central thesis rests on the domain assumption that the control-theoretic framing applies to LLM agents and on references to existing external benchmarks whose details are not reproduced here.

axioms (1)
  • domain assumption The harness functions as the controller of a closed-loop dynamical system with the LLM acting as the stochastic policy.
    Invoked to explain performance variance dominance in the control-theoretic formalization section of the abstract.

pith-pipeline@v0.9.1-grok · 5746 in / 1200 out tokens · 29651 ms · 2026-06-30T23:13:26.888155+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    Opencode: The open source coding agent., 2025

    Anomaly. Opencode: The open source coding agent., 2025. URL https://github.com/ anomalyco/opencode

  2. [2]

    Claude-code, 2025

    Anthropic. Claude-code, 2025. URLhttps://github.com/anthropics/claude-code

  3. [3]

    I improved 15 LLMs at coding in one afternoon

    Can Bölük. I improved 15 LLMs at coding in one afternoon. Only the harness changed., February 2026. URLhttps://blog.can.ac/2026/02/12/the-harness-problem/

  4. [4]

    Why benchmarking is hard

    Florian Brand and JSD. Why benchmarking is hard. Epoch AI, Gradient Updates, December

  5. [5]

    URLhttps://epochai.substack.com/p/why-benchmarking-is-hard

  6. [6]

    Nex-n1: Agentic models trained via a unified ecosystem for large-scale environment construction.arXiv preprint arXiv:2512.04987, 2025

    Yuxuan Cai, Lu Chen, Qiaoling Chen, Yuyang Ding, Liwen Fan, Wenjie Fu, Yufei Gao, Honglin Guo, Pinxue Guo, Zhenhua Han, et al. Nex-n1: Agentic models trained via a unified ecosystem for large-scale environment construction.arXiv preprint arXiv:2512.04987, 2025

  7. [7]

    Mle-bench: Evaluating machine learning agents on machine learning engineering

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, et al. Mle-bench: Evaluating machine learning agents on machine learning engineering. InThe Thirteenth International Conference on Learning Representations

  8. [8]

    Deepseek-v4: Towards highly efficient million-token context intelligence, April

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, April

  9. [9]

    URL https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/ DeepSeek_V4.pdf

  10. [10]

    SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

    Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, et al. Swe-bench pro: Can ai agents solve long-horizon software engineering tasks?arXiv preprint arXiv:2509.16941, 2025

  11. [11]

    Agent psychometrics: Task-level performance prediction in agentic coding benchmarks.arXiv preprint arXiv:2604.00594, 2026

    Chris Ge, Daria Kryvosheieva, Daniel Fried, Uzay Girit, and Kaivalya Hariharan. Agent psychometrics: Task-level performance prediction in agentic coding benchmarks.arXiv preprint arXiv:2604.00594, 2026

  12. [12]

    Terminus-2, 2026

    Harbor. Terminus-2, 2026. URL https://www.harborframework.com/docs/agents/ terminus-2

  13. [13]

    Automated Design of Agentic Systems

    Shengran Hu, Cong Lu, and Jeff Clune. Automated design of agentic systems.arXiv preprint arXiv:2408.08435, 2024

  14. [14]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations

  15. [15]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  16. [16]

    Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, et al. Holistic agent leaderboard: The missing infrastructure for ai agent evaluation.arXiv preprint arXiv:2510.11977, 2025

  17. [17]

    Meta-Harness: End-to-End Optimization of Model Harnesses

    Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn. Meta-harness: End-to-end optimization of model harnesses.arXiv preprint arXiv:2603.28052, 2026

  18. [18]

    Deepagent: A general reasoning agent with scalable toolsets

    Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, et al. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, pages 2219–2230, 2026

  19. [19]

    Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

    Jiahang Lin, Shichun Liu, Chengjun Pan, Lizhi Lin, Shihan Dou, Xuanjing Huang, Hang Yan, Zhenhua Han, and Tao Gui. Agentic harness engineering: Observability-driven automatic evolution of coding-agent harnesses, 2026. URLhttps://arxiv.org/abs/2604.25850. 10

  20. [20]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  21. [21]

    Autoharness: Improving llm agents by automatically synthesizing a code harness

    Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, and Kevin P Murphy. Autoharness: improving llm agents by automatically synthesizing a code harness.arXiv preprint arXiv:2603.03329, 2026

  22. [22]

    Scaling Managed Agents: Decoupling the brain from the hands, April 2026

    Lance Martin, Gabe Cemaj, and Michael Cohen. Scaling Managed Agents: Decoupling the brain from the hands, April 2026. URL https://www.anthropic.com/engineering/ managed-agents

  23. [23]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  24. [24]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  25. [25]

    Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering? InInterna- tional Conference on Machine Learning, pages 44412–44450

    Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering? InInterna- tional Conference on Machine Learning, pages 44412–44450. PMLR, 2025

  26. [26]

    Kimi-k2.6, April 2026

    Moonshot AI. Kimi-k2.6, April 2026. URL https://huggingface.co/moonshotai/ Kimi-K2.6

  27. [27]

    SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%, March 2026

    Morph. SWE-Bench Pro Leaderboard (2026): Why 46% Beats 81%, March 2026. URL https://www.morphllm.com/swe-bench-pro

  28. [28]

    Codex cli, 2025

    OpenAI. Codex cli, 2025. URLhttps://developers.openai.com/codex/cli

  29. [29]

    Introducing gpt-5.4, March 2026

    OpenAI. Introducing gpt-5.4, March 2026. URL https://openai.com/index/ introducing-gpt-5-4/

  30. [30]

    Training Software Engineering Agents and Verifiers with SWE-Gym

    Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym.arXiv preprint arXiv:2412.21139, 2024

  31. [31]

    We removed 80% of our agent’s tools, December 2025

    Andrew Qu. We removed 80% of our agent’s tools, December 2025. URL https://vercel. com/blog/we-removed-80-percent-of-our-agents-tools

  32. [32]

    Harness design for long-running application develop- ment, March 2026

    Prithvi Rajasekaran. Harness design for long-running application develop- ment, March 2026. URL https://www.anthropic.com/engineering/ harness-design-long-running-apps

  33. [33]

    Hermes agent — the agent that grows with you, 2026

    Nous Research. Hermes agent — the agent that grows with you, 2026. URL https:// hermes-agent.nousresearch.com/

  34. [34]

    Raising the Bar on SWE-bench Verified with Claude 3.5 Sonnet, January 2025

    Erik Schluntz. Raising the Bar on SWE-bench Verified with Claude 3.5 Sonnet, January 2025. URLhttps://www.anthropic.com/research/swe-bench-sonnet

  35. [35]

    Seeing like an agent: How we design tools in Claude Code, April 2026

    Thariq Shihipar. Seeing like an agent: How we design tools in Claude Code, April 2026. URL https://claude.com/blog/seeing-like-an-agent

  36. [36]

    MIT press Cambridge, 1998

    Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

  37. [37]

    Qwen3.6-plus: Towards real world agents, April 2026

    Qwen Team. Qwen3.6-plus: Towards real world agents, April 2026. URL https://qwenlm. github.io/blog/qwen3.6/

  38. [38]

    Improving Deep Agents with harness engineer- ing, February 2026

    Vivek Trivedy. Improving Deep Agents with harness engineer- ing, February 2026. URL https://www.langchain.com/blog/ improving-deep-agents-with-harness-engineering. 11

  39. [39]

    SWE-Bench, 2026

    Vals AI. SWE-Bench, 2026. URLhttps://www.vals.ai/benchmarks/swebench

  40. [40]

    Apex-agents, 2026

    Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, et al. Apex- agents.arXiv preprint arXiv:2601.14242, 2026

  41. [41]

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741, 2024

  42. [42]

    Demystifying llm- based software engineering agents.Proceedings of the ACM on Software Engineering, 2(FSE): 801–824, 2025

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Demystifying llm- based software engineering agents.Proceedings of the ACM on Software Engineering, 2(FSE): 801–824, 2025

  43. [43]

    SWE-agent: Agent-computer interfaces enable automated soft- ware engineering

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated soft- ware engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URLhttps://openreview.net/forum?id=mXpq6ut8J3

  44. [44]

    SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025

    John Yang, Carlos E Jimenez, Alex L Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik R Narasimhan, Diyi Yang, Sida Wang, and Ofir Press. SWE-bench multimodal: Do AI systems generalize to visual software domains? InThe Thirteenth International Conference on Learning Representations, 2025. URL https: //ope...

  45. [45]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  46. [46]

    Glm-5.1, April 2026

    Z.ai. Glm-5.1, April 2026. URLhttps://huggingface.co/zai-org/GLM-5.1

  47. [47]

    Agentic context engineering: Evolving contexts for self-improving language models

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. Agentic context engineering: Evolving contexts for self-improving language models. InThe Fourteenth International Conference on Learning Representations,

  48. [48]

    URLhttps://openreview.net/forum?id=eC4ygDs02R

  49. [49]

    Bigcodebench: Benchmarking code generation with diverse function calls and complex instructions

    Terry Yue Zhuo, Vu Minh Chien, Jenny Chim, Han Hu, Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani Yusuf, Haolan Zhan, Junda He, Indraneil Paul, Simon Brunner, Chen GONG, James Hoang, Armel Randy Zebaze, Xiaoheng Hong, Wen-Ding Li, Jean Kaddour, Ming Xu, Zhihan Zhang, Prateek Yadav, Naman Jain, Alex Gu, Zhoujun Cheng, Jiawei Liu, Qian Liu, Zijian Wang, Davi...

  50. [50]

    subset100

    Gregor Zunic. The bitter lesson of agent harnesses, April 2026. URL https://browser-use. com/posts/bitter-lesson-agent-harnesses. 12 A Example ETCSOVG Disclosure Card Table 3 gives the full field set that we expect a benchmark submission to disclose. The compact example in Table 4 instantiates these fields forH 3 in our controlled experiment. Table 3: ETC...