DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance, 2025
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.AI 6years
2026 6representative citing papers
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
Retrospective of a 2025 AI agent competition finds public-private score misalignment, an inert composite component, multi-account registrations, and guardrail fixes outperforming architectural novelty.
A literature survey finds foundation-model agents in industry are 75% at prototype stages with gains in human interaction and uncertainty handling but deficits in negotiation, plus limitations like hallucinations and latency.
citing papers explorer
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
-
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
-
IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs
IndustryBench is a standards-grounded Chinese benchmark that exposes LLMs' persistent gaps in industrial terminology, safety compliance, and parameter accuracy, with safety checks reshuffling model rankings.
-
SPIN: Structural LLM Planning via Iterative Navigation for Industrial Tasks
SPIN enforces DAG-valid plans and prefix-based stopping for LLM agents, cutting executed tasks from 1061 to 623 and tool calls from 11.81 to 6.82 per run on AssetOpsBench while raising success from 0.638 to 0.706.
-
Results and Retrospective Analysis of the CODS 2025 AssetOpsBench Challenge
Retrospective of a 2025 AI agent competition finds public-private score misalignment, an inert composite component, multi-account registrations, and guardrail fixes outperforming architectural novelty.
-
Foundation-Model-Based Agents in Industrial Automation: Purposes, Capabilities, and Open Challenges
A literature survey finds foundation-model agents in industry are 75% at prototype stages with gains in human interaction and uncertainty handling but deficits in negotiation, plus limitations like hallucinations and latency.