arxiv: 2602.22953 · v2 · submitted 2026-02-26 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

General Agent Evaluation

Elron Bandel , Asaf Yehudai , Lilach Eden , Yehoshua Sagron , Yotam Perlitz , Elad Venezian , Natalia Razinkov , Natan Ergas

show 7 more authors

Shlomit Shachor Ifergan Segev Shlomov Michal Jacovi Leshem Choshen Liat Ein-Dor Yoav Katz Michal Shmueli-Scheuer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:11 UTC · model grok-4.3

classification 💻 cs.AI

keywords general agentsagent evaluationLLM agentsbenchmarksunifying protocoltool callingcode generation

0 comments

The pith

General-purpose agents adapt to every tested domain without customization and match leading specialized agents on four of six benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs the first head-to-head evaluation of multiple general agent architectures against the same set of backbone models on six heterogeneous benchmarks. It supplies a unifying protocol and harness so that any agent can be plugged into any benchmark without per-domain wiring or human-authored glue. The central result is that these general agents handle unfamiliar environments in software engineering, customer service, research, and assistance tasks, with no architecture-specific tuning required. Architecture choice shifts scores by up to twelve points inside one model, yet the backbone model still explains most of the performance gap. On four benchmarks the strongest general configurations are statistically indistinguishable from the best heavily customized domain-specific agents.

Core claim

We introduce a unifying protocol and evaluation harness that let any general-purpose agent run on any benchmark. With this setup we test five agent architectures and five backbone LLMs across six benchmarks. General agents adapt to every domain without per-domain customization. On four of the six benchmarks the top general agents are indistinguishable from the leading domain-specific agents. Agent architecture moves results by up to twelve points for a fixed model, yet backbone choice dominates. Open-weight models exhibit generality sinks on particular architectures or benchmarks while frontier closed models do not. Behavioral traces reveal architecture-distinctive error signatures that sum-

What carries the argument

The unifying protocol and evaluation harness that surface any benchmark to any general-purpose agent, enabling a full factorial comparison of five architectures, five backbone models, and six benchmarks.

Load-bearing premise

The six chosen benchmarks and the new unifying protocol fairly represent general-agent behavior in arbitrary unfamiliar environments without bias from the harness itself.

What would settle it

A new benchmark in an entirely novel domain where even the strongest general-agent configurations fall substantially below the leading domain-specific agents would falsify the adaptation claim.

read the original abstract

General-purpose agents perform tasks in unfamiliar environments without domain-specific manual customization. Yet no study has systematically measured how agent architecture shapes performance across heterogeneous protocols and diverse unfamiliar environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked such a study: existing harnesses require per-benchmark wiring or fixed protocol classes (web for BrowserGym, CLI for Harbor), and benchmarks themselves expect human-authored prompts, context, and integration glue. To enable this study, we contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent and backbone model; and (3) the first Open General Agent Leaderboard of agent configurations, a full factorial over 5 agent architectures x 5 backbone LLMs (three closed-source, two open-weight) x 6 benchmarks spanning software engineering, customer service, deep research, and personal assistance. We find that (i) general agents adapt to every tested domain without per-domain customization; (ii) agent architecture choice swings results by up to 12pp within a single model, yet backbone model choice dominates overall performance; (iii) on 4 of 6 tested benchmarks, top general agents are indistinguishable from the leading heavily-customized domain-specific agents; (iv) open-weight models tested exhibit "generality sinks" absent from frontier closed-source models: they consistently collapse on specific agent architectures or benchmarks; (v) a behavioral failure analysis reveals architecture-distinctive error signatures that aggregate scoring cannot discriminate. Code, harness, leaderboard, and traces are at https://www.exgentic.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper supplies a unifying harness for head-to-head agent architecture comparisons across benchmarks, but the harness needs explicit neutrality checks before the main claims can be taken at face value.

read the letter

The punchline is that this work provides a shared evaluation harness that makes it possible to compare different general-purpose agent architectures across multiple benchmarks using the same models and protocol. Prior studies couldn't do this directly because each benchmark had its own wiring requirements. They contribute a unifying protocol that bridges existing benchmarks and agents, plus the harness itself, and then run the experiment with five architectures, five models, and six benchmarks covering software engineering to personal assistance. The results indicate that general agents can handle the domains without per-domain tweaks, architecture affects performance by as much as 12 points within a model, but the backbone model has a larger effect overall. On four of the six benchmarks, the top general agents perform similarly to heavily customized domain-specific ones. They also note that open-weight models show more inconsistent behavior across setups, and they include an analysis of distinct error types per architecture. Making the code and leaderboard public is a solid move. One area that needs more attention is the potential for the harness to favor certain architectures. The stress test raises a fair point: if the way prompts, context, or integration glue is handled gives an edge to tool-calling or code-generation agents, then the claims about generality and matching domain-specific performance could be influenced by that rather than pure agent capability. The abstract does not detail ablations or validation steps for neutrality, so that would be important to see in the full paper. The findings on error signatures are promising but would benefit from more concrete examples. This paper is for researchers and practitioners working on building versatile AI agents. It offers a practical way to evaluate options without reinventing the test setup each time. A reader interested in agent benchmarks or model comparisons would get direct value from the leaderboard and the comparative data. It deserves a serious referee because the harness could become a useful community resource, and the scale of the experiment is substantial enough to justify review even if revisions are needed on the bias checks. I would bring this to a reading group to go over the harness details and discuss how to test for neutrality. I would not cite it in my own work until the full methods are confirmed. It should go to peer review.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a unifying protocol and evaluation harness to enable systematic comparison of general-purpose agent architectures (tool-calling, MCP, code-generation, CLI) across six heterogeneous benchmarks spanning software engineering, customer service, deep research, and personal assistance, using a full factorial design over five agent architectures and five backbone LLMs. It claims that general agents adapt to every tested domain without per-domain customization, that agent architecture swings performance by up to 12pp within a model while backbone choice dominates, that top general agents are statistically indistinguishable from leading domain-specific agents on 4 of 6 benchmarks, that open-weight models exhibit generality sinks, and that behavioral failure analysis reveals architecture-distinctive error signatures. The work releases the harness, leaderboard, code, and traces publicly.

Significance. If the unifying protocol is shown to be neutral, the study supplies the first large-scale empirical evidence on the relative contributions of architecture versus backbone model to general-agent performance and demonstrates that general agents can match heavily customized domain-specific systems on multiple benchmarks. The public harness and full-factorial leaderboard constitute a reusable infrastructure contribution that directly addresses the prior fragmentation of agent evaluation protocols.

major comments (2)

[Abstract] Abstract and Methods: The central claim that general agents match domain-specific agents on 4/6 benchmarks without per-domain customization rests on the assumption that the contributed unifying protocol introduces no systematic bias favoring particular architectures (e.g., tool-calling or code-generation over CLI). No ablation studies, cross-protocol validation, or comparison against prior per-benchmark harnesses are described to substantiate neutrality; this is load-bearing for the indistinguishability and 12pp architecture-swing results.
[Results] Results: The statements that top general agents are 'indistinguishable' from domain-specific leaders and that architecture choice produces a 12pp swing require supporting data tables, confidence intervals, and statistical tests. These are referenced only at a high level; without them the comparative findings cannot be verified at the level needed for the paper's conclusions.

minor comments (1)

[Abstract] The term 'generality sinks' for open-weight models is introduced without an explicit operational definition or quantitative threshold in the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of rigor that we address below. We have revised the manuscript to incorporate additional validation, data presentation, and statistical support as outlined in our point-by-point responses.

read point-by-point responses

Referee: [Abstract] Abstract and Methods: The central claim that general agents match domain-specific agents on 4/6 benchmarks without per-domain customization rests on the assumption that the contributed unifying protocol introduces no systematic bias favoring particular architectures (e.g., tool-calling or code-generation over CLI). No ablation studies, cross-protocol validation, or comparison against prior per-benchmark harnesses are described to substantiate neutrality; this is load-bearing for the indistinguishability and 12pp architecture-swing results.

Authors: We agree that empirical validation of protocol neutrality is essential to support the core claims. The unifying protocol was intentionally designed as a minimal, architecture-agnostic translation layer that preserves original benchmark task specifications, prompts, and evaluation metrics without injecting architecture-specific context or modifications. To substantiate this, the revised manuscript will add a dedicated subsection in Methods describing the protocol's design principles and implementation details. We will also include a new ablation study comparing performance of the same agent configurations under the unifying protocol versus the original per-benchmark harnesses on a feasible subset of benchmarks (where direct replication is possible). This will provide direct evidence that no systematic bias favoring particular architectures is introduced. revision: yes
Referee: [Results] Results: The statements that top general agents are 'indistinguishable' from domain-specific leaders and that architecture choice produces a 12pp swing require supporting data tables, confidence intervals, and statistical tests. These are referenced only at a high level; without them the comparative findings cannot be verified at the level needed for the paper's conclusions.

Authors: We accept this critique and will substantially expand the Results section. The revised version will include complete data tables reporting performance for every agent architecture × backbone LLM combination across all six benchmarks. Each score will be accompanied by 95% confidence intervals computed via bootstrapping over multiple evaluation runs. We will add pairwise statistical comparisons (using McNemar's test for success/failure outcomes and appropriate parametric or non-parametric tests for continuous metrics) to formally support claims of statistical indistinguishability (e.g., p > 0.05) between top general agents and domain-specific leaders, as well as to quantify the maximum 12pp architecture-induced swing with error bars and significance. These tables and tests will be referenced explicitly in the text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements on public benchmarks

full rationale

The paper is a purely empirical study that contributes a unifying protocol and harness, then reports direct performance measurements across 5 architectures, 5 models, and 6 benchmarks. No derivation chain, equations, fitted parameters, or first-principles predictions exist that could reduce to the authors' own inputs by construction. All claims (i)–(v) rest on observed experimental outcomes rather than self-referential definitions or load-bearing self-citations. The evaluation is therefore self-contained against external public benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the unifying protocol preserves benchmark semantics and that the six benchmarks adequately sample unfamiliar environments; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption The unifying protocol does not alter task difficulty or introduce new failure modes compared with native benchmark interfaces.
Invoked when claiming that general agents perform comparably to domain-specific ones.

pith-pipeline@v0.9.0 · 5678 in / 1185 out tokens · 19091 ms · 2026-05-15T19:11:59.519451+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We contribute (1) a unifying protocol that bridges existing benchmark and agent protocols; (2) an evaluation harness that surfaces any benchmark to any general-purpose agent...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Model choice accounts for 28.2% of total success rate variance... agent architecture explains only 0.6%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning and Reusing Policy Decompositions for Hierarchical Generalized Planning with LLM Agents
cs.AI 2026-05 unverdicted novelty 6.0

HCL-GP learns parameterized policies and reuses extracted components to achieve 98% accuracy on AppWorld benchmark tasks for LLM agents, outperforming static synthesis by 15.8 points on challenges.
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

Circuit analysis reveals that routing circuits for agent memory emerge at 0.6B parameters while content circuits emerge at 4B, with a shared grounding hub and an unsupervised diagnostic achieving 76.2% accuracy for lo...
What Happens Inside Agent Memory? Circuit Analysis from Emergence to Diagnosis
cs.AI 2026-05 unverdicted novelty 6.0

In LLM agents, memory routing circuits emerge at 0.6B scale while content circuits appear only at 4B, and write/read operations recruit a pre-existing late-layer context hub instead of creating a new one, enabling a 7...
Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI evaluation and governance lack mechanisms to bind obligations to actions and prove compliance at runtime; a new synthesis framework with ODTA criteria and action-evidence bundles addresses this closure gap.