PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
Title resolution pending
11 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
citing papers explorer
-
PhysicianBench: Evaluating LLM Agents in Real-World EHR Environments
PhysicianBench is a new benchmark of 100 physician-reviewed, execution-grounded tasks in live EHR environments where the best LLM agent reaches only 46% success and open-source models reach 19%.
-
Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Presents TRUST-Bench benchmark for hidden-trigger tool compromises in LLM agents and VISTA-Guard framework for trajectory-aware risk scoring of final actions under untrusted feedback.
-
RS-Claw: Progressive Active Tool Exploration via Hierarchical Skill Trees for Remote Sensing Agents
RS-Claw enables remote sensing agents to actively explore tools via hierarchical skill trees, achieving up to 86% token compression and outperforming flat registration and RAG baselines on Earth-Bench.
-
CHI-Bench: Can AI Agents Automate End-to-End, Long-Horizon, Policy-Rich Healthcare Workflows?
CHI-Bench shows current AI agents achieve at most 28% success on long-horizon healthcare workflows that require dense policy adherence, multi-role handoffs, and multi-turn interactions.
-
Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows
EduAgentBench is a new source-grounded benchmark that evaluates tutor agents across pedagogical judgment, situated multi-turn tutoring, and Canvas-style workflow completion, finding frontier models capable of basic judgment but inadequate for professional teaching standards.
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
Claw-Eval is a new trajectory-aware benchmark for LLM agents that records execution traces, audit logs, and environment snapshots to evaluate completion, safety, and robustness across 300 tasks, revealing that opaque grading misses 44% of safety issues.
-
TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents
MM-ToolBench introduces 100 closed-loop multimodal tasks across two domains with 27 MCP servers and 324 tools, where agents must execute, inspect artifacts, and revise before final output.
-
Irminsul: MLA-Native Position-Independent Caching for Agentic LLM Serving
Irminsul recovers up to 83% of prompt tokens above exact-prefix matching and delivers 63% prefill energy savings per cache hit on MLA-MoE models by content-hashing CDC chunks and applying closed-form kr correction.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.