OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
Mcpmark: A benchmark for stress-testing realistic and comprehensive mcp use
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 1polarities
background 1representative citing papers
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.
Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.
citing papers explorer
-
OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.
-
AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation
AJ-Bench provides 155 tasks in three domains to evaluate environment-interacting agent judges, showing performance gains over LLM-as-a-Judge but exposing remaining verification challenges.
-
MCP-DPT: A Defense-Placement Taxonomy and Coverage Analysis for Model Context Protocol Security
MCP-DPT creates a defense-placement taxonomy that organizes MCP threats and defenses across six architectural layers, revealing mostly tool-centric protections and gaps at orchestration, transport, and supply-chain layers.
-
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts
AgencyBench is a new benchmark with 138 tasks in 32 scenarios that measures autonomous agent performance on extended real-world problems using simulated feedback and sandboxed assessment.
-
From Language to Action: Enhancing LLM Task Efficiency with Task-Aware MCP Server Recommendation
Introduces Task2MCP dataset and T2MRec model for recommending MCP servers to LLM agents based on task semantics and engineering constraints.