OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
$OneMillion-Bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.
citing papers explorer
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
-
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
-
AlphaEval: Evaluating Agents in Production
AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.
-
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
SemanticQA unifies prior multiword expression datasets into a benchmark that reveals substantial performance variation among language models on semantic reasoning tasks.