arxiv: 2604.18240 · v1 · submitted 2026-04-20 · 💻 cs.AI

Recognition: unknown

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi , Yu Wang , Yuyang Zhao , Yuxin Chen , Fuli Feng , Xueyuan Hao , Xi Su , Qi Gu

show 3 more authors

Hui Su Xunliang Cai Xiangnan He

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:03 UTC · model grok-4.3

classification 💻 cs.AI

keywords Agent-as-a-JudgebenchmarkLLM agentsenvironment interactionverificationreinforcement learningAI evaluation

0 comments

The pith

Agent-as-a-Judge models that interact with environments outperform LLM-as-a-Judge baselines in verification tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AJ-Bench to evaluate Agent-as-a-Judge capabilities for verifying behaviors in complex environments by actively interacting with tools and surroundings. It covers 155 tasks and 516 annotated trajectories in search, data systems, and graphical user interfaces, testing information acquisition, state verification, and process verification. Experiments indicate that these interactive judges achieve better results than standard LLM judges, although significant challenges in agent-based verification persist. A reader would care because scaling reinforcement learning for agents requires reliable ways to check their actions beyond fixed rules or passive models.

Core claim

We introduce AJ-Bench, a benchmark with 155 tasks and 516 annotated trajectories across three domains to systematically assess Agent-as-a-Judge in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines while revealing substantial open challenges in agent-based verification.

What carries the argument

AJ-Bench, the benchmark that evaluates agent judges through interaction with environments to acquire verifiable evidence for information acquisition, state verification, and process verification.

If this is right

Agent judges can obtain evidence through active interaction where passive LLM judges fall short.
Performance improvements hold consistently across the tested domains of search, data systems, and GUIs.
Open challenges in agent-based verification indicate areas for future development.
The benchmark enables standardized assessment of progress in environment-aware judge agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Broader adoption of agent judges could improve feedback loops in training other AI agents via reinforcement learning.
Extending the benchmark to new domains would help determine if the gains generalize beyond the three chosen areas.
Hybrid systems combining agent judges with rules might address some of the remaining challenges.

Load-bearing premise

The 516 annotated trajectories and three domains provide a representative and unbiased measure of general agent-as-a-judge capabilities.

What would settle it

Evaluating the agent judges on tasks from additional domains not covered in the benchmark and observing no performance gains compared to LLM judges would indicate the gains are not general.

Figures

Figures reproduced from arXiv: 2604.18240 by Fuli Feng, Hui Su, Qi Gu, Wentao Shi, Xiangnan He, Xi Su, Xueyuan Hao, Xunliang Cai, Yu Wang, Yuxin Chen, Yuyang Zhao.

**Figure 1.** Figure 1: Agent-as-a-Judge outperforms LLM-as-aJudge by using tools and environment access to verify the correct release date. 2025; Cheng et al., 2025). However, as RL computation continues to scale, a fundamental challenge emerges: how to verify agent behaviors in novel environments at scale. Recent studies primarily rely on rule-based verification (Shao et al., 2024; Mroueh, 2025), where agent trajectories are… view at source ↗

**Figure 2.** Figure 2: Overview of the benchmark and evaluation pipeline. The upper part illustrates the benchmark construction process, including task design, trajectory collection, and label annotation. The lower part depicts the evaluation workflow of Agent-as-a-Judge, where the environment is initialized prior to evaluation. 2.1 Benchmarks for LLM-Based Judges To evaluate LLM-based judges, early benchmarks primarily measured… view at source ↗

**Figure 3.** Figure 3: Task distribution of AJ-Bench 3.3.3 GUI Domain In the GUI domain, labels in OSWorld are initially assigned using rule-based scripts that compare execution trajectory outputs against golden files for office tasks. However, these scripts are inherently limited in their ability to capture all execution details and edge cases, which may lead to mislabeling. To ensure label reliability, we therefore manually … view at source ↗

**Figure 5.** Figure 5: Comparing the effect of each modality on [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Failed to call tools Misinterpreted tool output **Task**: Find and fix the bug in VoteNet’s `backbone_module.py`. **Ground truth bug**: `self.fp2 = PointnetFPModule(mlp=[256,256,256])` should be `mlp=[512,256,256]` (input channels should be 256+256=512). **What the agent did**: Added `end_points[’sa3_inds’] = fps_inds` — a completely different “fix” that doesn’t address the actual bug. **What the judge did… view at source ↗

**Figure 7.** Figure 7: Misinterpreted tool output Correct evidence, wrong reasoning **Task**: Find all files with duplicate content and move them to a `duplicates/` directory [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Correct evidence, wrong reasoning A.10 Cases Across Domains A.10.1 Search Domain Query: Find 3 movies that premiered in 2024 and had matching theatrical release dates in both mainland China and the United States. For each movie, clearly provide the title, the shared theatrical release date, and credible source links (such as IMDb or official news articles) explicitly confirming these theatrical release dat… view at source ↗

**Figure 9.** Figure 9: Search Domain Trajectory A.10.2 DS Domain Query: Please use FileSystem tools to finish the following task: Read all the files under current path, extract every time/plan information that clearly indicates 2024, and integrate them into a list and create a file in main directory called ` timeline.txt`. Write the timeline in the file in the following format. ### Rules - If a task only shows month without day,… view at source ↗

**Figure 10.** Figure 10: DS Domain Trajectory A.10.3 GUI Domain Query: Move the image to the right side on Slide 2. Original Trajectory: {"step_num": 1, "action_timestamp": "20250722@212748", "action": {"name": "computer", "input": {"action": "left_click", "coordinate": [116, 263]}, "id": " toolu_bdrk_015e8fhjpZH3AJ58Tn3znPT9", "action_type": "tool_use", "command": "pyautogui.click (174, 394)\n"}, "reward": 0, "done": false, "inf… view at source ↗

**Figure 11.** Figure 11: GUI Domain Trajectory A.11 Prompts A.11.1 Search Domain LLM-as-a-Judge Prompt (Wide) # LLM-as-a-Judge Task You will verify the factual correctness of ONE Single-Row extracted from a Response. You MUST follow all instructions precisely and avoid any interpretation that goes beyond what is explicitly required. ————————————————— ## Provided Inputs ### Query [The Start of Query] {query} [The End of Query] ###… view at source ↗

**Figure 12.** Figure 12: Search Domain (Wide) LLM-as-a-Judge Prompt [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Search Domain (Deep) LLM-as-a-Judge Prompt [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

**Figure 14.** Figure 14: Search Domain (Wide) Agent-as-a-Judge Prompt [PITH_FULL_IMAGE:figures/full_fig_p031_14.png] view at source ↗

**Figure 15.** Figure 15: Search Domain (Deep) Agent-as-a-Judge Prompt [PITH_FULL_IMAGE:figures/full_fig_p035_15.png] view at source ↗

**Figure 16.** Figure 16: DS Domain LLM-as-a-Judge Prompt Agent-as-a-Judge Prompt # Task Description {task_description} [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

**Figure 17.** Figure 17: DS Domain Agent-as-a-Judge Prompt 2 A.11.3 GUI Domain 3 The relatively long prompts in the GUI domain do not mainly reflect task-specific instruction complexity, 4 but rather implementation constraints. Specifically, because OSWorld does not provide an official MCP 5 server, we introduced a lightweight custom adapter to integrate it into our framework. As a result, tool 6 definitions, parameter constraint… view at source ↗

**Figure 18.** Figure 18: GUI Domain LLM-as-a-Judge Prompt Agent-as-a-Judge Prompt (System) You are an intelligent agent that evaluates individual task trajectories by interacting with live environments. Your goal is to determine whether a single trajectory CORRECTLY completes a given task. Your workflow: 1. You will receive a task instruction, a trajectory description and the observation(a11y tree, screenshot or both) of the envi… view at source ↗

**Figure 19.** Figure 19: GUI Domain Agent-as-a-Judge System Prompt [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗

**Figure 20.** Figure 20: GUI Domain Agent-as-a-Judge User Prompt Agent-as-a-Judge Prompt (Observation) Environment returned {observation_type}: {observation} Based on this observation, choose your next action: ACTION: {{"action_type": "ACTION_TYPE", "param1": value1, "param2": value2}} REASONING: [explanation of your decision based on the observation] Provide only one action per response. Avoid providing multiple ACTION lines. Th… view at source ↗

**Figure 21.** Figure 21: GUI Domain Agent-as-a-Judge Observation Prompt [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗

**Figure 22.** Figure 22: GUI Domain Agent-as-a-Judge Action Prompt [PITH_FULL_IMAGE:figures/full_fig_p042_22.png] view at source ↗

**Figure 23.** Figure 23: GUI Domain Agent-as-a-Judge Judgement Prompt [PITH_FULL_IMAGE:figures/full_fig_p043_23.png] view at source ↗

read the original abstract

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents' abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AJ-Bench gives a concrete testbed for agents that judge by interacting with environments, but the three-domain limit makes the reported gains hard to read as general.

read the letter

The main takeaway is a benchmark with 155 tasks and 516 trajectories across search, data systems, and GUIs. It measures how well an agent judge can gather evidence by actually using tools and checking states or processes, rather than just reading a static trace. The experiments show these judges beat plain LLM-as-a-Judge baselines on the three abilities the authors define: information acquisition, state verification, and process verification. Releasing the tasks and code is the practical part that lets others run their own comparisons right away.

Referee Report

3 major / 2 minor

Summary. The paper introduces AJ-Bench, a benchmark for evaluating Agent-as-a-Judge models that actively interact with environments and tools. It comprises 155 tasks and 516 annotated trajectories across three domains (search, data systems, and GUIs), assessing capabilities in information acquisition, state verification, and process verification. Experiments report consistent performance gains over LLM-as-a-Judge baselines while identifying substantial open challenges in agent-based verification, with data and code released.

Significance. If the tasks prove representative and the evaluations control for key variables, AJ-Bench could meaningfully advance reliable verification of scaling LLM agents by moving beyond passive or rule-based judges. The public release of tasks and trajectories supports reproducibility and follow-on work in environment-aware evaluation.

major comments (3)

[Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of consistent gains over LLM-as-a-Judge baselines and identification of general open challenges rests on the 155 tasks/516 trajectories providing an unbiased sample. With coverage limited to search, data systems, and GUIs, the benchmark may under-sample long-horizon state changes or multi-agent coordination; the paper should quantify domain coverage and test for selection bias (e.g., easier evidence acquisition in search vs. GUIs) to support the generalization implied by the title and abstract.
[§5] §5 (Experiments): Performance gains are presented without reported statistical significance tests, confidence intervals, or explicit controls for prompt length, tool access, or annotation quality. This leaves open whether observed improvements are robust or confounded by experimental setup details, directly affecting the reliability of the 'consistent performance gains' claim.
[§4] §4 (Trajectory Annotation): The construction and quality assurance of the 516 annotated trajectories are not described in sufficient detail (e.g., inter-annotator agreement, criteria for 'verifiable evidence'). Since the benchmark's value depends on these ground-truth labels, missing methodological specifics undermine the ability to interpret the reported open challenges and gains.

minor comments (2)

[Abstract] Abstract: The phrase 'substantial open challenges' is used without enumeration; adding one sentence listing the primary challenges (e.g., specific failure modes in state vs. process verification) would improve clarity for readers.
[Figures/Tables] Figure/Table captions: Ensure all figures comparing agent vs. LLM judges explicitly state the exact metrics, number of runs, and whether results are averaged over the 516 trajectories.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and describe the revisions we will make.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim of consistent gains over LLM-as-a-Judge baselines and identification of general open challenges rests on the 155 tasks/516 trajectories providing an unbiased sample. With coverage limited to search, data systems, and GUIs, the benchmark may under-sample long-horizon state changes or multi-agent coordination; the paper should quantify domain coverage and test for selection bias (e.g., easier evidence acquisition in search vs. GUIs) to support the generalization implied by the title and abstract.

Authors: We agree the three domains do not cover every scenario, including long-horizon state changes or multi-agent coordination. These domains were chosen to span distinct interaction modes (information retrieval, structured data operations, and visual action sequences). In revision we will expand §3 with quantitative coverage metrics (task length distributions, state-change counts, tool-use statistics) and an explicit comparison of evidence-acquisition difficulty across domains. We will also add a limitations subsection that discusses selection bias and the scope of claimed generalization. revision: partial
Referee: [§5] §5 (Experiments): Performance gains are presented without reported statistical significance tests, confidence intervals, or explicit controls for prompt length, tool access, or annotation quality. This leaves open whether observed improvements are robust or confounded by experimental setup details, directly affecting the reliability of the 'consistent performance gains' claim.

Authors: We accept that statistical tests and controls are required for robust claims. The revised §5 will report paired significance tests and bootstrap confidence intervals on the performance deltas. We will also document prompt-length standardization, confirm identical tool access across conditions, and clarify how annotation quality was maintained during data collection. revision: yes
Referee: [§4] §4 (Trajectory Annotation): The construction and quality assurance of the 516 annotated trajectories are not described in sufficient detail (e.g., inter-annotator agreement, criteria for 'verifiable evidence'). Since the benchmark's value depends on these ground-truth labels, missing methodological specifics undermine the ability to interpret the reported open challenges and gains.

Authors: We will expand §4 to provide the missing methodological detail. The revision will specify the annotation protocol, the exact criteria used to label verifiable evidence, the number of annotators, inter-annotator agreement statistics computed on a sampled subset, and the quality-assurance steps (cross-review and disagreement resolution). revision: yes

Circularity Check

0 steps flagged

Empirical benchmark paper with no derivations or self-referential reductions

full rationale

This is a benchmark introduction paper whose central claims consist of empirical performance comparisons (agent-as-a-judge vs. LLM-as-a-judge baselines) on a newly released dataset of 155 tasks and 516 trajectories. No equations, fitted parameters, derivations, or ansatzes appear in the provided text or abstract. The evaluation rests on the released tasks/trajectories rather than any quantity that reduces to its own inputs by construction. No self-citations are invoked to justify uniqueness theorems or load-bearing premises. The paper is therefore self-contained as an empirical contribution; the skeptic concern about domain representativeness is a validity question, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new physical entities are introduced. The work rests on standard assumptions that human annotations provide reliable ground truth and that the selected domains adequately sample complex environments.

pith-pipeline@v0.9.0 · 5481 in / 1135 out tokens · 25938 ms · 2026-05-10T04:03:50.634399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Trail: Trace reasoning and agentic issue local- ization.Preprint, arXiv:2505.08638. Google. 2025. A new era of intelligence with gem- ini 3. https://blog.google/products/gemini/ gemini-3/. Accessed: 2026-01-02. Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Son...

work page arXiv 2025
[2]

Sadhuka, D

E-valuator: Reliable agent verifiers with sequential hypothesis testing.Preprint, arXiv:2512.03109. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, and 1 others. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300. Yon...

work page arXiv 2024
[3]

Appworld: A controllable world of apps and people for benchmarking interactive coding agents,

Appworld: A controllable world of apps and people for benchmarking interactive coding agents. arXiv preprint arXiv:2407.18901. Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Lingpeng Kong, Qi Liu, Tianyu Liu, and Zhifang Sui. 2024. Large language models are not fair evaluators. InACL (1), pages 9440–9450. Association for Co...

work page arXiv 2024
[4]

Mcpmark: A benchmark for stress-testing re- alistic and comprehensive mcp use.arXiv preprint arXiv:2509.24002. xAI. 2025. Grok 4. https://x.ai/news/grok-4. Ac- cessed: 2026-01-02. Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and 1 others

work page arXiv 2025
[5]

Advances in Neural Information Processing Systems, 37:52040–52094

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. Advances in Neural Information Processing Systems, 37:52040–52094. Ran Xu, Jingjing Chen, Jiayu Ye, Yu Wu, Jun Yan, Carl Yang, and Hongkun Yu. 2025. Incentivizing agentic reasoning in LLM judges via tool-integrated reinforcement learning.CoRR, abs/2510.23038. An Ya...

work page arXiv 2025
[6]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022a. Webshop: Towards scalable real-world web interaction with grounded language agents.Advances in Neural Information Processing Systems, 35:20744–20757. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854. Mingchen Zhuge, Changsheng Zhao, Dylan R. Ashley, Wenyi Wang, Dmitrii Khizbullin, Yunyang Xiong, Zechun Liu, Ernie Chang, Raghuraman Krishnamoor- thi, Yuandong Tian, Yangyang Shi, Vikas Chandra, and Jürgen Schmidhuber. 2025. Agent-as-a-judge: Evaluate age...

work page internal anchor Pith review arXiv 2025
[8]

The Beekeeper - Box Office Mojo, https://www.boxofficemojo.com/release/rl1465155585/
[9]

en.wikipedia.org, https://en.wikipedia.org/wiki/The_Beekeeper_(2024_film)
[10]

The Beekeeper - Box Office Mojo, https://www.boxofficemojo.com/releasegroup/gr4020130309/
[11]

Chinese Box Office For 2024 - Box Office Mojo, https://www.boxofficemojo.com/year/2024/? area=CN

2024
[12]

Godzilla x Kong: The New Empire - Wikipedia, https://en.wikipedia.org/wiki/Godzilla_x_Kong: _The_New_Empire
[13]

China Box Office for Godzilla x Kong: The New Empire (..., https://www.the-numbers.com/ movie/Godzilla-x-Kong-The-New-Empire-(2024)/China

2024
[14]

en.wikipedia.org, https://en.wikipedia.org/wiki/Deadpool_%26_Wolverine
[15]

Deadpool & Wolverine (2024) China Press Tour'Fan Event'- YouTube, https://www.youtube.com /watch?v=zqYru1LZ_4Q

2024
[16]

The Beekeeper

Deadpool & Wolverine got the Chinese regulators blessing and and will debut in China's box office on July 26 : r/MarvelStudios_Rumours - Reddit, https://www.reddit.com/r/ MarvelStudios_Rumours/comments/1dhnzlr/deadpool_wolverine_got_the_chinese_regulators/ Verification: I'll evaluate the Single-Row "The Beekeeper" extracted from Rubric-1. Let me follow th...
[17]

The Beekeeper

"Identify films that premiered in 2024 and had consistent theatrical release dates in both mainland China and the United States. Award 1 point for each qualifying film identified, up to a maximum of 3 points." This is Rubric-1, which requires identifying films that: - Premiered in 2024 - Had consistent theatrical release dates in both mainland China and t...

2024
[18]

The Beekeeper

Now I need to verify that The Beekeeper premiered in 2024 (which it clearly did based on the January 12, 2024 date). **Step 4 - Final Verification** The Single-Row "The Beekeeper" extracted from Rubric-1 makes the claim that this is a film that:

2024
[19]

Premiered in 2024 (January 12, 2024)

2024
[20]

N/A". **Final judgement:** The Single-Row

Had consistent theatrical release dates in both mainland China and the United States (both January 12, 2024) The Response provides credible sources that confirm these facts: - Source 1 (Box Office Mojo): Shows China release date as January 12, 2024 - Source 2 (Wikipedia): Shows US release date as January 12, 2024 - Source 3 (Box Office Mojo release group)...

2024
[21]

First entry: exp_logs/project_2/analysis_report.md:2024-01-01

2024
[22]

Second entry: learning/2024/learning_progress.csv:2024-01-01 (same date, different file)

2024
[23]

Third entry: exp_logs/experiment_summary.md:2024-01-05 (later date)

2024
[24]

paths": [

Last entry: learning/2024/learning_progress.csv:2024-12-01 (December) The sorting appears to be chronological. - **Observation:** Chronological sorting appears correct. --- ... --- ### Step 21 - **Action Type:**`read_multiple_files` - **Target Files (Anonymized):** -`/path/to/backup_root/exp_logs/project_1/README.md` -`/path/to/backup_root/exp_logs/projec...

2024
[25]

The timeline.txt file was created in the main directory
[26]

All 2024 date information from all files was extracted

2024
[27]

Month-only dates were converted to the 1st day of the month
[28]

Year-only entries were skipped
[29]

No duplicate dates for the same file appear
[30]

The format is correct: file_path:time with relative paths
[31]

Sorting is chronological
[32]

step_num

Time periods use the start time The task appears to have been completed successfully according to all requirements. **VERDICT: PASS** Figure 10: DS Domain Trajectory A.10.3 GUI Domain Query: Move the image to the right side on Slide 2. Original Trajectory: {"step_num": 1, "action_timestamp": "20250722@212748", "action": {"name": "computer", "input": {"act...
[33]

Picture 3

**Initial Position (Step 1)** The`VIEW_TRAJECTORY_STEP`for Step 1 shows image`"Picture 3"`at position`(364, 415)`, which is on the **left side** of the slide. The slide presentation area spans from`x = 273`to`x = 1396`(width`1123`), so`x = 364`is clearly left-of-center
[34]

This places: - **Left edge:**`x = 930` - **Right edge:**`x = 1186` This is clearly on the **right side** of the slide

**Final Position (Live Interaction)** In the live environment, after clicking the image, the accessibility tree shows the image at`(930, 412)`with size`(256, 256)`. This places: - **Left edge:**`x = 930` - **Right edge:**`x = 1186` This is clearly on the **right side** of the slide
[35]

**Slide 2 of

**Slide Verification** The status bar displays “**Slide 2 of ”2** in both the live environment and trajectory steps, confirming the image is on **Slide 2**, as required
[36]

The corresponding command: -`pyautogui.dragTo(1050, 546, duration=0.5)` also indicates a rightward movement

**Trajectory Execution Evidence** The trajectory includes a drag action in Step 3 from`(325, 364)`to`(700, 364)`, consistent with moving the image horizontally to the right. The corresponding command: -`pyautogui.dragTo(1050, 546, duration=0.5)` also indicates a rightward movement. Additionally, Step ’3s accessibility tree already shows the image at`(930,...
[37]

Picture 3

**Selection Confirmation** After clicking the image in the live environment, the status bar updates to indicate that image`"Picture 3"`is selected, and the properties panel shows the **Image** toggle/button active. This confirms the UI is recognizing the correct object
[38]

on the right side

**Right-Side Assessment** The slide center is approximately at`x = 834`(`273 + 1123/2`). The ’images left edge at` x = 930`is to the **right of center**, and the image occupies the rightmost portion of the slide, satisfying the requirement of being “on the right side”. ### Conclusion The image was successfully moved from the left side to the right side **...
[40]

N/A", "NA

Is factually correct according to your knowledge. You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Role and High-Level Objective Act as a neutral, impartial judge. ————————————————— ## Step-by-Step Evaluation Procedure ### Step 1 — Initial Read - Read the Query, Response, and Single...
[43]

Otherwise, assign **[0]**

It faithfully represents the Response. Otherwise, assign **[0]**. ### If ANY claim is: - incorrect, OR - missing, OR - unverifiable →assign **[0]** ————————————————— ## Strict Requirements - You MUST NOT add or infer information beyond what is provided. ————————————————— ## Output Format (Strict) Your final output must be exactly one line: Response: [1] o...
[46]

**Is factually correct according to your knowledge;**
[52]

N/A" rule (mandatory)** - If the Single-Row contains

**"N/A" rule (mandatory)** - If the Single-Row contains "N/A", **you must immediately label it as [0], regardless of any other information.** ————————————————— ## Distinguishing Requirement Types (CRITICAL) You MUST determine whether the matched Rubric scoring unit requires: ### A. Factual content (e.g., a person’s name, a date, a fact, an explanation) →Y...
[57]

Otherwise, assign **[0]**

No part of the claim is "N/A". Otherwise, assign **[0]**. ————————————————— ## Output Format (Strict) You MUST output only: - Response: [1] or - Response: [0] Nothing else. Figure 13: Search Domain (Deep) LLM-as-a-Judge Prompt Agent-as-a-Judge Prompt (Wide) # Agent-as-a-Judge Task You will verify the factual correctness of ONE Single-Row extracted from a ...
[58]

Correctly reflects the part of the Response it was extracted from
[59]

N/A", "NA

Is factually correct according to verification tools. You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Role and High-Level Objective Act as a neutral, impartial judge. All factual checks must be performed using Playwright MCP tools. ————————————————— ## Step-by-Step Evaluation Proce...
[60]

It contains at least one claim relevant to the Query; AND
[61]

Every claim is factually correct; AND
[62]

Otherwise, assign **[0]**

It faithfully represents the Response. Otherwise, assign **[0]**. ### If ANY claim is: - incorrect, OR - missing, OR - unverifiable →assign **[0]** ————————————————— ## Strict Requirements - You MUST use Playwright MCP tools for all factual verification. - You MUST NOT add or infer information beyond what is provided. ————————————————— ## Output Format (S...
[63]

**Directly corresponds to exactly ONE scoring unit in the Rubric;**
[64]

**Correctly reflects the part of the Response it was extracted from;**
[65]

**Is factually correct according to verification tools;**
[66]

**Meets ALL requirements of the corresponding Rubric scoring unit;**
[67]

You MUST evaluate only the correctness of THIS Single-Row’s own claims

**Contains at least one claim relevant to the Query.** You MUST ignore completeness. You MUST evaluate only the correctness of THIS Single-Row’s own claims. ————————————————— ## Structural Constraints (Very Important)
[68]

Extracted From Rubric

**Single-Row = One scoring unit** - The "Extracted From Rubric" ID in the Single-Row must match exactly one scoring unit in the overall Rubric
[69]

**No mixing** - The Single-Row must not combine multiple rubric units into one claim
[70]

Partial fulfillment = label [0].**

**Strict multi-condition requirement** - If the matched Rubric scoring unit contains multiple requirements, **the Single-Row must satisfy ALL of them. Partial fulfillment = label [0].**
[71]

N/A" rule (mandatory)** - If the Single-Row contains

**"N/A" rule (mandatory)** - If the Single-Row contains "N/A", **you must immediately label it as [0], regardless of any other information.** ————————————————— ## Distinguishing Requirement Types (CRITICAL) You MUST determine whether the matched Rubric scoring unit requires: ### A. Factual content (e.g., a person’s name, a date, a fact, an explanation) →Y...
[72]

It makes at least one claim relevant to the Query AND
[73]

Every claim is factually correct AND
[74]

It satisfies ALL requirements of the matched Rubric scoring unit AND
[75]

It faithfully represents its source in the Response AND
[76]

Otherwise, assign **[0]**

No part of the claim is "N/A". Otherwise, assign **[0]**. ————————————————— ## Output Format (Strict) You MUST output only: - Response: [1] or - Response: [0] Nothing else. Figure 15: Search Domain (Deep) Agent-as-a-Judge Prompt A.11.2 DS Domain1 LLM-as-a-Judge Prompt You are an impartial evaluator. Your job is to determine whether the given Agent Traject...
[77]

Check the current state of the environment
[78]

Verify that the expected outputs/changes exist
[79]

VERDICT: PASS

Confirm the correctness of any created or modified resources After your investigation, provide your final verdict: - "VERDICT: PASS" if the task was completed successfully - "VERDICT: FAIL" if the task was not completed or was completed incorrectly Be thorough and check all requirements mentioned in the task description. Figure 17: DS Domain Agent-as-a-Ju...
[80]

Act as a **neutral and impartial judge**
[81]

Analyze the trajectory to determine if it successfully completes the given instruction
[82]

Consider whether the trajectory follows logical steps and achieves the intended goal
[83]

Look for any errors, inconsistencies, or failures in the execution
[84]

## Expected Output: - First, provide your detailed analysis of the trajectory, explaining why it succeeds or fails

Provide a clear explanation of your reasoning. ## Expected Output: - First, provide your detailed analysis of the trajectory, explaining why it succeeds or fails. - Then, output your final verdict in this exact format: - If the trajectory **successfully completes** the instruction, output:`[[TRUE]]` - If the trajectory **fails to complete** the instructio...
[85]

You will receive a task instruction, a trajectory description and the observation(a11y tree, screenshot or both) of the environment
[86]

You can interact with a live environment that has already executed this trajectory
[87]

Execute some actions to gather information and evaluate the current state of the environment
[88]

[TRUN- CATED_WARNING]

Make a final judgement on whether the trajectory CORRECTLY completes the task You should provide ONLY ONE action per response. Each response should contain exactly ONE ACTION line and ONE REASONING line. Do NOT provide multiple actions in a single response. The system will execute only one action at a time, and you will receive the result before choosing ...

1920