{"paper":{"title":"ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures.","cross_cats":["cs.AI","stat.ML"],"primary_cat":"cs.LG","authors_text":"Aditi Kumaresan, Wenjun Zeng, Yizheng Huang, Zi Wang","submitted_at":"2026-04-25T01:33:57Z","abstract_excerpt":"Evaluating generative AI models is increasingly resource-intensive due to slow inference, expensive raters, and a rapidly growing landscape of models and benchmarks. We propose ProEval, a proactive evaluation framework that leverages transfer learning to efficiently estimate performance and identify failure cases. ProEval employs pre-trained Gaussian Processes (GPs) as surrogates for the performance score function, mapping model inputs to metrics such as the severity of errors or safety violations. By framing performance estimation as Bayesian quadrature (BQ) and failure discovery as superleve"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That pre-trained Gaussian Processes trained on prior model evaluations can accurately serve as surrogates for the performance score function on new models and inputs, enabling effective transfer and active selection without significant distribution shift.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1a7cf2a0832f8f6697d04450b8525b7142a95ab74728ece60a5dffd9851edeeb"},"source":{"id":"2604.23099","kind":"arxiv","version":2},"verdict":{"id":"69ade725-b7bb-4a89-9230-3732abd4c125","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-08T08:17:42.260967Z","strongest_claim":"Theoretically, we prove that our pre-trained GP-based BQ estimator is unbiased and bounded. Empirically, extensive experiments on reasoning, safety alignment, and classification benchmarks demonstrate that ProEval is significantly more efficient than competitive baselines. It requires 8-65x fewer samples to achieve estimates within 1% of the ground truth, while simultaneously revealing more diverse failure cases under a stricter evaluation budget.","one_line_summary":"ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That pre-trained Gaussian Processes trained on prior model evaluations can accurately serve as surrogates for the performance score function on new models and inputs, enabling effective transfer and active selection without significant distribution shift.","pith_extraction_headline":"ProEval uses pre-trained Gaussian Processes as surrogates to estimate generative AI performance accurately with 8-65 times fewer samples while finding more failures."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2604.23099/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-21T09:39:48.515648Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T23:25:57.629052Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"497dfeb80ef1c782d9d349eb021da0491ed54fd00ff20210c25a7d0edea0e9bb"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}