Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai; Lilach Eden; Michal Shmueli-Scheuer

arxiv: 2605.22608 · v1 · pith:T3AFEL2Anew · submitted 2026-05-21 · 💻 cs.CL · cs.AI

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Asaf Yehudai , Lilach Eden , Michal Shmueli-Scheuer This is my paper

Pith reviewed 2026-05-22 06:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsevaluation frameworkmulti-level analysisagent behaviorautomated feedbackerror alignmenttask success prediction

0 comments

The pith

Agentic CLEAR automates dynamic multi-level evaluation of LLM agents with data-driven textual feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Agentic CLEAR as an automatic evaluation framework for autonomous LLM agents that define strategies and interact with environments. It generates textual insights at three levels of granularity—system, trace, and node—without depending on static hand-crafted error taxonomies. Experiments across four benchmarks, seven agentic settings, and tens of thousands of LLM calls show strong alignment with human-annotated errors and the ability to predict task success rates. This addresses the gap in overseeing agent behavior beyond basic observability tools.

Core claim

Agentic CLEAR produces high-quality, data-driven, insightful feedback on agent behavior at system, trace, and node levels. It operates above the observability layer with an intuitive UI and demonstrates strong alignment with human-annotated errors along with the ability to predict task success rate.

What carries the argument

The Agentic CLEAR framework, which uses LLMs to generate dynamic textual insights at three granularity levels while integrating seamlessly above existing observability layers.

If this is right

Enables seamless integration into agent systems via an accessible UI for ongoing oversight.
Provides feedback that correlates with and can predict whether agents succeed on tasks.
Adapts evaluations dynamically to new domains and agent behaviors without manual taxonomy updates.
Supports analysis at multiple scales from overall system performance down to individual nodes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The generated insights could be looped back to refine agent strategies in subsequent runs.
This approach might generalize to evaluating hybrid systems combining LLMs with other tools or planners.
Scalability testing on longer agent traces could highlight where multi-level granularity adds the most value.

Load-bearing premise

LLMs can generate accurate and domain-adaptive textual insights into agent behavior without relying on static hand-crafted error taxonomies or extensive domain-specific customization.

What would settle it

Human annotations of agent errors on a new benchmark showing low agreement with Agentic CLEAR feedback or failure to accurately predict task success rates.

Figures

Figures reproduced from arXiv: 2605.22608 by Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer.

**Figure 2.** Figure 2: The interactive UI of Agentic CLEAR, enabling multi-granular evaluation and diagnosis of agentic [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Agentic CLEAR gives a workable multi-level dynamic eval setup for agents with UI integration, but the headline claims on human alignment rest on missing quantitative details.

read the letter

The main thing to know is that this paper describes Agentic CLEAR as an automatic framework that generates textual feedback on LLM agent behavior at system, trace, and node levels, sitting on top of existing observability tools and adding an intuitive UI. It moves past static error taxonomies by trying to adapt dynamically across domains without heavy hand-crafting upfront. They tested it on four benchmarks, seven agent settings, and a large number of calls, which shows some effort at scale. That combination of granularity levels plus accessibility features is the clearest practical step forward here, and it could help teams that already have tracing in place but need more structured insights without constant manual work. The approach treats evaluation as something that can run automatically and produce readable output rather than just logs or simple scores, which matches real needs in agent deployment. On the positive side, the paper engages directly with the stated limitations of current tools and positions its method as filling that gap through LLM-driven analysis at multiple resolutions. The UI mention suggests they thought about usability beyond just the backend logic. The soft spots sit mostly in the validation. The abstract and summary assert strong alignment with human-annotated errors plus the ability to predict task success, yet no specific metrics appear for how that alignment was measured, no inter-annotator agreement figures, no sample sizes for the human study, and no description of whether annotators saw the framework outputs or how traces were selected. Without those, it is difficult to separate genuine added insight from shared model priors or surface features already visible in the traces. The central assumption that the system can deliver accurate domain-adaptive textual feedback without extensive customization also lacks the controls or error analysis that would make the experimental outcomes convincing. This is the sort of paper that would interest engineers and researchers working on production agent systems who need better automated oversight to cut down on manual review costs. A reader focused on practical tooling rather than theoretical advances could extract the framework description and UI ideas even if the results section needs tightening. I would send it for peer review so the experimental details, annotation protocol, and any quantitative results can be checked and strengthened rather than desk-rejecting it outright.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Agentic CLEAR, an automatic and dynamic evaluation framework for LLM agents that generates textual insights into agent behavior at three levels of granularity (system, trace, and node). It positions itself above the observability layer with an intuitive UI and claims to avoid limitations of static hand-crafted error taxonomies by being domain-adaptive. Experiments across four benchmarks, seven agentic settings, and tens of thousands of LLM calls are reported to show high-quality data-driven feedback, strong alignment with human-annotated errors, and the ability to predict task success rates.

Significance. If the reported alignment and predictive results hold under rigorous validation, the framework could meaningfully advance automated oversight of autonomous agentic systems by providing scalable, adaptive textual analysis without extensive domain customization. The multi-level granularity and UI accessibility represent practical strengths for integration into existing agent development pipelines.

major comments (2)

[Abstract] Abstract and experimental summary: the central claim of 'strong alignment with human-annotated errors' and high-quality feedback lacks any reported quantitative metrics (e.g., error-type precision/recall, Cohen's kappa, correlation coefficients, or inter-annotator agreement), sample sizes for the human study, or description of the annotation protocol and whether annotators were blinded to Agentic CLEAR outputs. This is load-bearing for the headline result and prevents assessment of whether alignment exceeds what would be expected from shared LLM priors.
[Experiments] Experimental section (implied by abstract): no controls, baselines, or error analysis are described for the success-rate prediction claim. It is unclear whether prediction relies on features already visible in raw traces or genuinely derives from the generated insights, which is necessary to substantiate the framework's added value over simpler observability tools.

minor comments (2)

[Abstract] The abstract mentions 'four benchmarks' and 'seven agentic settings' but does not name them; providing these explicitly would improve reproducibility and allow readers to assess domain coverage.
Notation for the three evaluation levels (system/trace/node) is introduced without a diagram or pseudocode example in the early sections; a small illustrative figure would clarify the granularity distinctions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of Agentic CLEAR and for the constructive comments that help strengthen the presentation of our results. We address each major comment below and commit to revisions that improve transparency without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract and experimental summary: the central claim of 'strong alignment with human-annotated errors' and high-quality feedback lacks any reported quantitative metrics (e.g., error-type precision/recall, Cohen's kappa, correlation coefficients, or inter-annotator agreement), sample sizes for the human study, or description of the annotation protocol and whether annotators were blinded to Agentic CLEAR outputs. This is load-bearing for the headline result and prevents assessment of whether alignment exceeds what would be expected from shared LLM priors.

Authors: We agree that the abstract would benefit from explicit quantitative support for the alignment claim. The full manuscript contains a human evaluation study with reported alignment measures; we will revise the abstract to include the key metrics (e.g., observed agreement rates and correlation values) along with sample sizes. We will also expand the experimental section to describe the annotation protocol, blinding procedures, and add a baseline comparison against predictions derived solely from shared LLM priors to demonstrate that the observed alignment exceeds what would be expected from model priors alone. revision: yes
Referee: [Experiments] Experimental section (implied by abstract): no controls, baselines, or error analysis are described for the success-rate prediction claim. It is unclear whether prediction relies on features already visible in raw traces or genuinely derives from the generated insights, which is necessary to substantiate the framework's added value over simpler observability tools.

Authors: We acknowledge that clearer controls are needed to isolate the contribution of the generated insights. We will revise the experimental section to include explicit baselines that compare success-rate prediction using only raw trace features versus features derived from Agentic CLEAR's multi-level textual insights. We will also add an error analysis section detailing the predictive model and feature importance to show that the insights provide incremental value beyond basic observability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework validated against external human annotations and success rates

full rationale

The paper presents Agentic CLEAR as an LLM-based framework generating multi-level textual insights into agent behavior without static hand-crafted taxonomies. Its central claims rest on empirical experiments across four benchmarks and seven settings, reporting alignment with independently human-annotated errors plus correlation to task success rates. These are external benchmarks and annotations, not quantities defined or fitted internally by the framework. No equations, parameter fits, or derivations are described that reduce by construction to the framework's own outputs. No load-bearing self-citations or uniqueness theorems from prior author work are invoked in the abstract or summary. The evaluation chain is self-contained against external measures and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The framework rests on the assumption that LLMs can reliably critique agent behavior across domains and that benchmarks provide valid ground truth for alignment and prediction.

axioms (2)

domain assumption LLM-based evaluators can produce insights aligned with human judgments without hand-crafted taxonomies
This is the core operating premise stated in the motivation and results sections of the abstract.
domain assumption Existing tools are limited to observability or static error taxonomies
Directly invoked in the abstract to motivate the new framework.

invented entities (1)

Agentic CLEAR framework no independent evidence
purpose: To provide automatic, dynamic, multi-level textual evaluation of LLM agents
The main new artifact introduced by the paper.

pith-pipeline@v0.9.0 · 5696 in / 1284 out tokens · 60136 ms · 2026-05-22T06:02:30.584744+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Agentic CLEAR produces textual insights into the agent behavior on three levels of granularity: system, trace, and node... strong alignment with human-annotated errors and the ability to predict task success rate.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We validate Agentic CLEAR through two complementary analyses... macro-averaged F1... AUC between the ground-truth and the predicted scores.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages

[1]

Kunlun Zhu and Zijia Liu and Bingxuan Li and Muxin Tian and Yingxuan Yang and Jiaxun Zhang and Pengrui Han and Qipeng Xie and Fuyang Cui and Weijia Zhang and Xiaoteng Ma and Xiaodong Yu and Gowtham Ramesh and Yusheng Su and Jialian Wu and Zicheng Liu and Pan Lu and James Zou and Jiaxuan You , year=. Where

work page
[2]

arXiv preprint arXiv:2509.14647 , year=

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production , author=. arXiv preprint arXiv:2509.14647 , year=

work page arXiv
[3]

Gonzalez and Ion Stoica , booktitle=

Mert Cemri and Melissa Z Pan and Shuyi Yang and Lakshya A Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , booktitle=. Why Do Multi-Agent. 2026 , url=

work page 2026
[4]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[5]

Publications Manual , year = "1983", publisher =

work page 1983
[6]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[7]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[8]

Dan Gusfield , title =. 1997

work page 1997
[9]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[10]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[11]

2025 , eprint=

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy , author=. 2025 , eprint=

work page 2025
[12]

2025 , eprint=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. 2025 , eprint=

work page 2025
[13]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

work page 2024
[14]

2024 , eprint=

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents , author=. 2024 , eprint=

work page 2024
[15]

2023 , eprint=

GAIA: a benchmark for General AI Assistants , author=. 2023 , eprint=

work page 2023
[16]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page
[17]

Frontiers Comput

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1
[18]

LangFuse: Observability for AI Applications , year =

work page
[19]

2023 , url =

LangSmith , title =. 2023 , url =

work page 2023
[20]

Claude-Code , year =

work page
[21]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025
[22]

Advanced Materials , volume=

SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning , author=. Advanced Materials , volume=. 2025 , publisher=

work page 2025
[23]

Deshpande, V

Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

work page arXiv
[24]

2025 , eprint=

Survey on Evaluation of LLM-based Agents , author=. 2025 , eprint=

work page 2025
[25]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025
[26]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025
[27]

Holistic Agent Leaderboard: The Missing Infrastructure for

Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Nitya Nadgir and Zachary S Siegel and Boyi Wei and Tianci Xue and Ziru Chen and Felix Chen and Saiteja Utpala and Franck Ndzomga and Dheeraj Oruganty and Sophie Luskin and Kangheng Liu and Botao Yu and Amit Arora and Dongyoon Hahm and Harsh Trivedi and Huan Sun and Juyong Lee and Tengjun Jin and Yifa...

work page 2026
[28]

Open-source

Roucher, Aymeric and Villanova del Moral, Albert and Noyan, Merve and Wolf, Thomas and Fourrier, Cl\'. Open-source. Hugging Face Blog , url =. 2025 , month = feb, day =

work page 2025
[29]

2025 , eprint=

Towards Enterprise-Ready Computer Using Generalist Agent , author=. 2025 , eprint=

work page 2025
[30]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging

work page
[31]

2025 , eprint=

JuStRank: Benchmarking LLM Judges for System Ranking , author=. 2025 , eprint=

work page 2025
[32]

2025 , eprint=

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. 2025 , eprint=

work page 2025
[33]

2025 , eprint=

Reinforcement Learning with Rubric Anchors , author=. 2025 , eprint=

work page 2025
[34]

2025 , eprint=

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research , author=. 2025 , eprint=

work page 2025
[35]

ICLR Blogposts 2026 , year =

Bandel, Elron and Yehudai, Asaf and Shmueli-Scheuer, Michal , title =. ICLR Blogposts 2026 , year =

work page 2026
[36]

SSRN Electronic Journal , url=

Agentic Systems Should be General , author=. SSRN Electronic Journal , url=

work page
[37]

2026 , eprint=

General Agent Evaluation , author=. 2026 , eprint=

work page 2026
[38]

2026 , eprint=

CUBE: A Standard for Unifying Agent Benchmarks , author=. 2026 , eprint=

work page 2026
[39]

Second Conference on Language Modeling , year=

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories , author=. Second Conference on Language Modeling , year=

work page

[1] [1]

Kunlun Zhu and Zijia Liu and Bingxuan Li and Muxin Tian and Yingxuan Yang and Jiaxun Zhang and Pengrui Han and Qipeng Xie and Fuyang Cui and Weijia Zhang and Xiaoteng Ma and Xiaodong Yu and Gowtham Ramesh and Yusheng Su and Jialian Wu and Zicheng Liu and Pan Lu and James Zou and Jiaxuan You , year=. Where

work page

[2] [2]

arXiv preprint arXiv:2509.14647 , year=

AgentCompass: Towards Reliable Evaluation of Agentic Workflows in Production , author=. arXiv preprint arXiv:2509.14647 , year=

work page arXiv

[3] [3]

Gonzalez and Ion Stoica , booktitle=

Mert Cemri and Melissa Z Pan and Shuyi Yang and Lakshya A Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , booktitle=. Why Do Multi-Agent. 2026 , url=

work page 2026

[4] [4]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[5] [5]

Publications Manual , year = "1983", publisher =

work page 1983

[6] [6]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[7] [7]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[8] [8]

Dan Gusfield , title =. 1997

work page 1997

[9] [9]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[10] [10]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[11] [11]

2025 , eprint=

CLEAR: Error Analysis via LLM-as-a-Judge Made Easy , author=. 2025 , eprint=

work page 2025

[12] [12]

2025 , eprint=

^2 -Bench: Evaluating Conversational Agents in a Dual-Control Environment , author=. 2025 , eprint=

work page 2025

[13] [13]

2024 , eprint=

SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. 2024 , eprint=

work page 2024

[14] [14]

2024 , eprint=

AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents , author=. 2024 , eprint=

work page 2024

[15] [15]

2023 , eprint=

GAIA: a benchmark for General AI Assistants , author=. 2023 , eprint=

work page 2023

[16] [16]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

work page

[17] [17]

Frontiers Comput

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1

[18] [18]

LangFuse: Observability for AI Applications , year =

work page

[19] [19]

2023 , url =

LangSmith , title =. 2023 , url =

work page 2023

[20] [20]

Claude-Code , year =

work page

[21] [21]

2025 , url =

OpenAI , title =. 2025 , url =

work page 2025

[22] [22]

Advanced Materials , volume=

SciAgents: automating scientific discovery through bioinspired multi-agent intelligent graph reasoning , author=. Advanced Materials , volume=. 2025 , publisher=

work page 2025

[23] [23]

Deshpande, V

Trail: Trace reasoning and agentic issue localization , author=. arXiv preprint arXiv:2505.08638 , year=

work page arXiv

[24] [24]

2025 , eprint=

Survey on Evaluation of LLM-based Agents , author=. 2025 , eprint=

work page 2025

[25] [25]

2025 , eprint=

gpt-oss-120b & gpt-oss-20b Model Card , author=. 2025 , eprint=

work page 2025

[26] [26]

2025 , eprint=

OpenAI GPT-5 System Card , author=. 2025 , eprint=

work page 2025

[27] [27]

Holistic Agent Leaderboard: The Missing Infrastructure for

Sayash Kapoor and Benedikt Stroebl and Peter Kirgis and Nitya Nadgir and Zachary S Siegel and Boyi Wei and Tianci Xue and Ziru Chen and Felix Chen and Saiteja Utpala and Franck Ndzomga and Dheeraj Oruganty and Sophie Luskin and Kangheng Liu and Botao Yu and Amit Arora and Dongyoon Hahm and Harsh Trivedi and Huan Sun and Juyong Lee and Tengjun Jin and Yifa...

work page 2026

[28] [28]

Open-source

Roucher, Aymeric and Villanova del Moral, Albert and Noyan, Merve and Wolf, Thomas and Fourrier, Cl\'. Open-source. Hugging Face Blog , url =. 2025 , month = feb, day =

work page 2025

[29] [29]

2025 , eprint=

Towards Enterprise-Ready Computer Using Generalist Agent , author=. 2025 , eprint=

work page 2025

[30] [30]

Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric and Zhang, Hao and Gonzalez, Joseph E and Stoica, Ion , booktitle =. Judging

work page

[31] [31]

2025 , eprint=

JuStRank: Benchmarking LLM Judges for System Ranking , author=. 2025 , eprint=

work page 2025

[32] [32]

2025 , eprint=

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains , author=. 2025 , eprint=

work page 2025

[33] [33]

2025 , eprint=

Reinforcement Learning with Rubric Anchors , author=. 2025 , eprint=

work page 2025

[34] [34]

2025 , eprint=

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research , author=. 2025 , eprint=

work page 2025

[35] [35]

ICLR Blogposts 2026 , year =

Bandel, Elron and Yehudai, Asaf and Shmueli-Scheuer, Michal , title =. ICLR Blogposts 2026 , year =

work page 2026

[36] [36]

SSRN Electronic Journal , url=

Agentic Systems Should be General , author=. SSRN Electronic Journal , url=

work page

[37] [37]

2026 , eprint=

General Agent Evaluation , author=. 2026 , eprint=

work page 2026

[38] [38]

2026 , eprint=

CUBE: A Standard for Unifying Agent Benchmarks , author=. 2026 , eprint=

work page 2026

[39] [39]

Second Conference on Language Modeling , year=

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories , author=. Second Conference on Language Modeling , year=

work page