pith. sign in
Pith Number

pith:VGH4GKH4

pith:2026:VGH4GKH4GKCI4ODDSOMW6I3V5C
not attested not anchored not stored refs resolved

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

Chenglin Wu, Sirui Hong, Tengfei Li, Wei Tao, Yifan Wu, Zhijie Liu

Trajectory-conditioned diagnostic probes recover 45-62 percent of failures misattributed to software defects in GUI-agent evaluations.

arxiv:2605.17439 v1 · 2026-05-17 · cs.SE · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{VGH4GKH4GKCI4ODDSOMW6I3V5C}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench.

C2weakest assumption

That targeted diagnostic probes chosen from the failed trajectory can produce an attribution signal that reliably separates evaluator-side execution errors from genuine software defects, without requiring reconstruction of the full latent state-transition graph or calibrated posteriors.

C3one line summary

DiagEval is a new diagnostic protocol that conditions on failed trajectories to attribute GUI-agent evaluation failures, recovering 45-62% of misattributed cases and lifting accuracy 8-16 points on two benchmarks.

References

61 extracted · 61 resolved · 9 Pith anchors

[1] Proceedings of the International Conference on Learning Representations (ICLR) , year=
[2] Tianbao Xie and Danyang Zhang and Jixuan Chen and Xiaochuan Li and Siheng Zhao and Ruisheng Cao and Toh Jing Hua and Zhoujun Cheng and Dongchan Shin and Fangyu Lei and Yitao Liu and Yiheng Xu and Shuy 2024
[3] Proceedings of the International Conference on Learning Representations (ICLR) , year=
[4] 2025 , url = 2025
[5] Advances in Neural Information Processing Systems (NeurIPS) , year=

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:04:38.914858Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

a98fc328fc32848e386393996f2375e880d106ea1d335d67386858fdef045956

Aliases

arxiv: 2605.17439 · arxiv_version: 2605.17439v1 · doi: 10.48550/arxiv.2605.17439 · pith_short_12: VGH4GKH4GKCI · pith_short_16: VGH4GKH4GKCI4ODD · pith_short_8: VGH4GKH4
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/VGH4GKH4GKCI4ODDSOMW6I3V5C \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: a98fc328fc32848e386393996f2375e880d106ea1d335d67386858fdef045956
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "7b2f0dbb00e3b0e9acb1cb12f2f4f0bda3919442953ac9bb3c084243f8d2a51b",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2026-05-17T13:22:22Z",
    "title_canon_sha256": "d9eff041f6321d1b73436f9d7179e1a1f664030a60a60027bd085455130cfb58"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.17439",
    "kind": "arxiv",
    "version": 1
  }
}