Recognition: unknown
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
Pith reviewed 2026-05-10 04:03 UTC · model grok-4.3
The pith
Agent-as-a-Judge models that interact with environments outperform LLM-as-a-Judge baselines in verification tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce AJ-Bench, a benchmark with 155 tasks and 516 annotated trajectories across three domains to systematically assess Agent-as-a-Judge in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines while revealing substantial open challenges in agent-based verification.
What carries the argument
AJ-Bench, the benchmark that evaluates agent judges through interaction with environments to acquire verifiable evidence for information acquisition, state verification, and process verification.
If this is right
- Agent judges can obtain evidence through active interaction where passive LLM judges fall short.
- Performance improvements hold consistently across the tested domains of search, data systems, and GUIs.
- Open challenges in agent-based verification indicate areas for future development.
- The benchmark enables standardized assessment of progress in environment-aware judge agents.
Where Pith is reading between the lines
- Broader adoption of agent judges could improve feedback loops in training other AI agents via reinforcement learning.
- Extending the benchmark to new domains would help determine if the gains generalize beyond the three chosen areas.
- Hybrid systems combining agent judges with rules might address some of the remaining challenges.
Load-bearing premise
The 516 annotated trajectories and three domains provide a representative and unbiased measure of general agent-as-a-judge capabilities.
What would settle it
Evaluating the agent judges on tasks from additional domains not covered in the benchmark and observing no performance gains compared to LLM judges would indicate the gains are not general.
Figures
read the original abstract
As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AJ-Bench, a benchmark for evaluating Agent-as-a-Judge models that actively interact with environments and tools. It comprises 155 tasks and 516 annotated trajectories across three domains (search, data systems, and GUIs), assessing capabilities in information acquisition, state verification, and process verification. Experiments report consistent performance gains over LLM-as-a-Judge baselines while identifying substantial open challenges in agent-based verification, with data and code released.
Significance. If the tasks prove representative and the evaluations control for key variables, AJ-Bench could meaningfully advance reliable verification of scaling LLM agents by moving beyond passive or rule-based judges. The public release of tasks and trajectories supports reproducibility and follow-on work in environment-aware evaluation.
major comments (3)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of consistent gains over LLM-as-a-Judge baselines and identification of general open challenges rests on the 155 tasks/516 trajectories providing an unbiased sample. With coverage limited to search, data systems, and GUIs, the benchmark may under-sample long-horizon state changes or multi-agent coordination; the paper should quantify domain coverage and test for selection bias (e.g., easier evidence acquisition in search vs. GUIs) to support the generalization implied by the title and abstract.
- [§5] §5 (Experiments): Performance gains are presented without reported statistical significance tests, confidence intervals, or explicit controls for prompt length, tool access, or annotation quality. This leaves open whether observed improvements are robust or confounded by experimental setup details, directly affecting the reliability of the 'consistent performance gains' claim.
- [§4] §4 (Trajectory Annotation): The construction and quality assurance of the 516 annotated trajectories are not described in sufficient detail (e.g., inter-annotator agreement, criteria for 'verifiable evidence'). Since the benchmark's value depends on these ground-truth labels, missing methodological specifics undermine the ability to interpret the reported open challenges and gains.
minor comments (2)
- [Abstract] Abstract: The phrase 'substantial open challenges' is used without enumeration; adding one sentence listing the primary challenges (e.g., specific failure modes in state vs. process verification) would improve clarity for readers.
- [Figures/Tables] Figure/Table captions: Ensure all figures comparing agent vs. LLM judges explicitly state the exact metrics, number of runs, and whether results are averaged over the 516 trajectories.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and describe the revisions we will make.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of consistent gains over LLM-as-a-Judge baselines and identification of general open challenges rests on the 155 tasks/516 trajectories providing an unbiased sample. With coverage limited to search, data systems, and GUIs, the benchmark may under-sample long-horizon state changes or multi-agent coordination; the paper should quantify domain coverage and test for selection bias (e.g., easier evidence acquisition in search vs. GUIs) to support the generalization implied by the title and abstract.
Authors: We agree the three domains do not cover every scenario, including long-horizon state changes or multi-agent coordination. These domains were chosen to span distinct interaction modes (information retrieval, structured data operations, and visual action sequences). In revision we will expand §3 with quantitative coverage metrics (task length distributions, state-change counts, tool-use statistics) and an explicit comparison of evidence-acquisition difficulty across domains. We will also add a limitations subsection that discusses selection bias and the scope of claimed generalization. revision: partial
-
Referee: [§5] §5 (Experiments): Performance gains are presented without reported statistical significance tests, confidence intervals, or explicit controls for prompt length, tool access, or annotation quality. This leaves open whether observed improvements are robust or confounded by experimental setup details, directly affecting the reliability of the 'consistent performance gains' claim.
Authors: We accept that statistical tests and controls are required for robust claims. The revised §5 will report paired significance tests and bootstrap confidence intervals on the performance deltas. We will also document prompt-length standardization, confirm identical tool access across conditions, and clarify how annotation quality was maintained during data collection. revision: yes
-
Referee: [§4] §4 (Trajectory Annotation): The construction and quality assurance of the 516 annotated trajectories are not described in sufficient detail (e.g., inter-annotator agreement, criteria for 'verifiable evidence'). Since the benchmark's value depends on these ground-truth labels, missing methodological specifics undermine the ability to interpret the reported open challenges and gains.
Authors: We will expand §4 to provide the missing methodological detail. The revision will specify the annotation protocol, the exact criteria used to label verifiable evidence, the number of annotators, inter-annotator agreement statistics computed on a sampled subset, and the quality-assurance steps (cross-review and disagreement resolution). revision: yes
Circularity Check
Empirical benchmark paper with no derivations or self-referential reductions
full rationale
This is a benchmark introduction paper whose central claims consist of empirical performance comparisons (agent-as-a-judge vs. LLM-as-a-judge baselines) on a newly released dataset of 155 tasks and 516 trajectories. No equations, fitted parameters, derivations, or ansatzes appear in the provided text or abstract. The evaluation rests on the released tasks/trajectories rather than any quantity that reduces to its own inputs by construction. No self-citations are invoked to justify uniqueness theorems or load-bearing premises. The paper is therefore self-contained as an empirical contribution; the skeptic concern about domain representativeness is a validity question, not a circularity issue.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Trail: Trace reasoning and agentic issue local- ization.Preprint, arXiv:2505.08638. Google. 2025. A new era of intelligence with gem- ini 3. https://blog.google/products/gemini/ gemini-3/. Accessed: 2026-01-02. Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Son...
-
[2]
E-valuator: Reliable agent verifiers with sequential hypothesis testing.Preprint, arXiv:2512.03109. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yon...
-
[3]
Appworld: A controllable world of apps and people for benchmarking interactive coding agents,
Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. Large language models are not fair evaluators. InACL (1), pages 9440–9450. Association for Co...
-
[4]
Mcpmark: A benchmark for stress-testing re- alistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002. xAI. 2025. Grok 4. https://x.ai/news/grok-4. Ac- cessed: 2026-01-02. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others
-
[5]
Advances in Neural Information Processing Systems, 37:52040–52094
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, and Hongkun Yu. 2025. Incentivizing agentic reasoning in LLM judges via tool-integrated reinforcement learning.CoRR, abs/2510.23038. An Ya...
-
[6]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022a. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854. Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoor- thi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. 2025. Agent-as-a-judge: Evaluate age...
work page internal anchor Pith review arXiv 2025
-
[8]
The Beekeeper - Box Office Mojo, https://www.boxofficemojo.com/release/rl1465155585/
-
[9]
en.wikipedia.org, https://en.wikipedia.org/wiki/The_Beekeeper_(2024_film)
-
[10]
The Beekeeper - Box Office Mojo, https://www.boxofficemojo.com/releasegroup/gr4020130309/
-
[11]
Chinese Box Office For 2024 - Box Office Mojo, https://www.boxofficemojo.com/year/2024/? area=CN
2024
-
[12]
Godzilla x Kong: The New Empire - Wikipedia, https://en.wikipedia.org/wiki/Godzilla_x_Kong: _The_New_Empire
-
[13]
China Box Office for Godzilla x Kong: The New Empire (..., https://www.the-numbers.com/ movie/Godzilla-x-Kong-The-New-Empire-(2024)/China
2024
-
[14]
en.wikipedia.org, https://en.wikipedia.org/wiki/Deadpool_%26_Wolverine
-
[15]
Deadpool & Wolverine (2024) China Press Tour'Fan Event'- YouTube, https://www.youtube.com /watch?v=zqYru1LZ_4Q
2024
-
[16]
The Beekeeper
Deadpool & Wolverine got the Chinese regulators blessing and and will debut in China's box office on July 26 : r/MarvelStudios_Rumours - Reddit, https://www.reddit.com/r/ MarvelStudios_Rumours/comments/1dhnzlr/deadpool_wolverine_got_the_chinese_regulators/ Verification: I'll evaluate the Single-Row "The Beekeeper" extracted from Rubric-1. Let me follow th...
-
[17]
The Beekeeper
"Identify films that premiered in 2024 and had consistent theatrical release dates in both mainland China and the United States. Award 1 point for each qualifying film identified, up to a maximum of 3 points." This is Rubric-1, which requires identifying films that: - Premiered in 2024 - Had consistent theatrical release dates in both mainland China and t...
2024
-
[18]
The Beekeeper
Now I need to verify that The Beekeeper premiered in 2024 (which it clearly did based on the January 12, 2024 date). **Step 4 - Final Verification** The Single-Row "The Beekeeper" extracted from Rubric-1 makes the claim that this is a film that:
2024
-
[19]
Premiered in 2024 (January 12, 2024)
2024
-
[20]
N/A". **Final judgement:** The Single-Row
Had consistent theatrical release dates in both mainland China and the United States (both January 12, 2024) The Response provides credible sources that confirm these facts: - Source 1 (Box Office Mojo): Shows China release date as January 12, 2024 - Source 2 (Wikipedia): Shows US release date as January 12, 2024 - Source 3 (Box Office Mojo release group)...
2024
-
[21]
First entry: exp_logs/project_2/analysis_report.md:2024-01-01
2024
-
[22]
Second entry: learning/2024/learning_progress.csv:2024-01-01 (same date, different file)
2024
-
[23]
Third entry: exp_logs/experiment_summary.md:2024-01-05 (later date)
2024
-
[24]
paths": [
Last entry: learning/2024/learning_progress.csv:2024-12-01 (December) The sorting appears to be chronological. - **Observation:** Chronological sorting appears correct. --- ... --- ### Step 21 - **Action Type:**`read_multiple_files` - **Target Files (Anonymized):** -`/path/to/backup_root/exp_logs/project_1/README.md` -`/path/to/backup_root/exp_logs/projec...
2024
-
[25]
The timeline.txt file was created in the main directory
-
[26]
All 2024 date information from all files was extracted
2024
-
[27]
Month-only dates were converted to the 1st day of the month
-
[28]
Year-only entries were skipped
-
[29]
No duplicate dates for the same file appear
-
[30]
The format is correct: file_path:time with relative paths
-
[31]
Sorting is chronological
-
[32]
step_num
Time periods use the start time The task appears to have been completed successfully according to all requirements. **VERDICT: PASS** Figure 10: DS Domain Trajectory A.10.3 GUI Domain Query: Move the image to the right side on Slide 2. Original Trajectory: {"step_num": 1, "action_timestamp": "20250722@212748", "action": {"name": "computer", "input": {"act...
-
[33]
Picture 3
**Initial Position (Step 1)** The`VIEW_TRAJECTORY_STEP`for Step 1 shows image`"Picture 3"`at position`(364, 415)`, which is on the **left side** of the slide. The slide presentation area spans from`x = 273`to`x = 1396`(width`1123`), so`x = 364`is clearly left-of-center
-
[34]
This places: - **Left edge:**`x = 930` - **Right edge:**`x = 1186` This is clearly on the **right side** of the slide
**Final Position (Live Interaction)** In the live environment, after clicking the image, the accessibility tree shows the image at`(930, 412)`with size`(256, 256)`. This places: - **Left edge:**`x = 930` - **Right edge:**`x = 1186` This is clearly on the **right side** of the slide
-
[35]
**Slide 2 of
**Slide Verification** The status bar displays “**Slide 2 of ”2** in both the live environment and trajectory steps, confirming the image is on **Slide 2**, as required
-
[36]
The corresponding command: -`pyautogui.dragTo(1050, 546, duration=0.5)` also indicates a rightward movement
**Trajectory Execution Evidence** The trajectory includes a drag action in Step 3 from`(325, 364)`to`(700, 364)`, consistent with moving the image horizontally to the right. The corresponding command: -`pyautogui.dragTo(1050, 546, duration=0.5)` also indicates a rightward movement. Additionally, Step ’3s accessibility tree already shows the image at`(930,...
-
[37]
Picture 3
**Selection Confirmation** After clicking the image in the live environment, the status bar updates to indicate that image`"Picture 3"`is selected, and the properties panel shows the **Image** toggle/button active. This confirms the UI is recognizing the correct object
-
[38]
on the right side
**Right-Side Assessment** The slide center is approximately at`x = 834`(`273 + 1123/2`). The ’images left edge at` x = 930`is to the **right of center**, and the image occupies the rightmost portion of the slide, satisfying the requirement of being “on the right side”. ### Conclusion The image was successfully moved from the left side to the right side **...
-
[40]
N/A", "NA
Is factually correct according to your knowledge. You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Role and High-Level Objective Act as a neutral, impartial judge. ————————————————— ## Step-by-Step Evaluation Procedure ### Step 1 — Initial Read - Read the Query, Response, and Single...
-
[43]
Otherwise, assign **[0]**
It faithfully represents the Response. Otherwise, assign **[0]**. ### If ANY claim is: - incorrect, OR - missing, OR - unverifiable →assign **[0]** ————————————————— ## Strict Requirements - You MUST NOT add or infer information beyond what is provided. ————————————————— ## Output Format (Strict) Your final output must be exactly one line: Response: [1] o...
-
[46]
**Is factually correct according to your knowledge;**
-
[52]
N/A" rule (mandatory)** - If the Single-Row contains
**"N/A" rule (mandatory)** - If the Single-Row contains "N/A", **you must immediately label it as [0], regardless of any other information.** ————————————————— ## Distinguishing Requirement Types (CRITICAL) You MUST determine whether the matched Rubric scoring unit requires: ### A. Factual content (e.g., a person’s name, a date, a fact, an explanation) →Y...
-
[57]
Otherwise, assign **[0]**
No part of the claim is "N/A". Otherwise, assign **[0]**. ————————————————— ## Output Format (Strict) You MUST output only: - Response: [1] or - Response: [0] Nothing else. Figure 13: Search Domain (Deep) LLM-as-a-Judge Prompt Agent-as-a-Judge Prompt (Wide) # Agent-as-a-Judge Task You will verify the factual correctness of ONE Single-Row extracted from a ...
-
[58]
Correctly reflects the part of the Response it was extracted from
-
[59]
N/A", "NA
Is factually correct according to verification tools. You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Role and High-Level Objective Act as a neutral, impartial judge. All factual checks must be performed using Playwright MCP tools. ————————————————— ## Step-by-Step Evaluation Proce...
-
[60]
It contains at least one claim relevant to the Query; AND
-
[61]
Every claim is factually correct; AND
-
[62]
Otherwise, assign **[0]**
It faithfully represents the Response. Otherwise, assign **[0]**. ### If ANY claim is: - incorrect, OR - missing, OR - unverifiable →assign **[0]** ————————————————— ## Strict Requirements - You MUST use Playwright MCP tools for all factual verification. - You MUST NOT add or infer information beyond what is provided. ————————————————— ## Output Format (S...
-
[63]
**Directly corresponds to exactly ONE scoring unit in the Rubric;**
-
[64]
**Correctly reflects the part of the Response it was extracted from;**
-
[65]
**Is factually correct according to verification tools;**
-
[66]
**Meets ALL requirements of the corresponding Rubric scoring unit;**
-
[67]
You MUST evaluate only the correctness of THIS Single-Row’s own claims
**Contains at least one claim relevant to the Query.** You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Structural Constraints (Very Important)
-
[68]
Extracted From Rubric
**Single-Row = One scoring unit** - The "Extracted From Rubric" ID in the Single-Row must match exactly one scoring unit in the overall Rubric
-
[69]
**No mixing** - The Single-Row must not combine multiple rubric units into one claim
-
[70]
Partial fulfillment = label [0].**
**Strict multi-condition requirement** - If the matched Rubric scoring unit contains multiple requirements, **the Single-Row must satisfy ALL of them. Partial fulfillment = label [0].**
-
[71]
N/A" rule (mandatory)** - If the Single-Row contains
**"N/A" rule (mandatory)** - If the Single-Row contains "N/A", **you must immediately label it as [0], regardless of any other information.** ————————————————— ## Distinguishing Requirement Types (CRITICAL) You MUST determine whether the matched Rubric scoring unit requires: ### A. Factual content (e.g., a person’s name, a date, a fact, an explanation) →Y...
-
[72]
It makes at least one claim relevant to the Query AND
-
[73]
Every claim is factually correct AND
-
[74]
It satisfies ALL requirements of the matched Rubric scoring unit AND
-
[75]
It faithfully represents its source in the Response AND
-
[76]
Otherwise, assign **[0]**
No part of the claim is "N/A". Otherwise, assign **[0]**. ————————————————— ## Output Format (Strict) You MUST output only: - Response: [1] or - Response: [0] Nothing else. Figure 15: Search Domain (Deep) Agent-as-a-Judge Prompt A.11.2 DS Domain1 LLM-as-a-Judge Prompt You are an impartial evaluator. Your job is to determine whether the given Agent Traject...
-
[77]
Check the current state of the environment
-
[78]
Verify that the expected outputs/changes exist
-
[79]
VERDICT: PASS
Confirm the correctness of any created or modified resources After your investigation, provide your final verdict: - "VERDICT: PASS" if the task was completed successfully - "VERDICT: FAIL" if the task was not completed or was completed incorrectly Be thorough and check all requirements mentioned in the task description. Figure 17: DS Domain Agent-as-a-Ju...
-
[80]
Act as a **neutral and impartial judge**
-
[81]
Analyze the trajectory to determine if it successfully completes the given instruction
-
[82]
Consider whether the trajectory follows logical steps and achieves the intended goal
-
[83]
Look for any errors, inconsistencies, or failures in the execution
-
[84]
## Expected Output: - First, provide your detailed analysis of the trajectory, explaining why it succeeds or fails
Provide a clear explanation of your reasoning. ## Expected Output: - First, provide your detailed analysis of the trajectory, explaining why it succeeds or fails. - Then, output your final verdict in this exact format: - If the trajectory **successfully completes** the instruction, output:`[[TRUE]]` - If the trajectory **fails to complete** the instructio...
-
[85]
You will receive a task instruction, a trajectory description and the observation(a11y tree, screenshot or both) of the environment
-
[86]
You can interact with a live environment that has already executed this trajectory
-
[87]
Execute some actions to gather information and evaluate the current state of the environment
-
[88]
[TRUN- CATED_WARNING]
Make a final judgement on whether the trajectory CORRECTLY completes the task You should provide ONLY ONE action per response. Each response should contain exactly ONE ACTION line and ONE REASONING line. Do NOT provide multiple actions in a single response. The system will execute only one action at a time, and you will receive the result before choosing ...
1920
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.