DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
method 1
citation-polarity summary
fields
cs.AI 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.
citing papers explorer
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deployment issue.
-
PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
PHMForge benchmark shows LLM agents achieve 80.8% pass@1 on prognostic tasks with native MCP tools but performance collapses from 100% to 20% when using text RAG instead.