PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools
Pith reviewed 2026-05-13 22:00 UTC · model grok-4.3
The pith
LLM agents reach 80.8% success on industrial prognostics tasks with executable MCP tools, but drop to 20% when limited to text retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PHMForge shows that LLM agents achieve 80.8% pass@1 on 99 SME-authored PHM scenarios when they can invoke 39 MCP-native wrappers around published algorithms such as C-MAPSS and Arrhenius models. Replacing those executable calls with text-based RAG over equivalent evidence reduces Remaining Useful Life pass-all-3 on the battery class from 5/5 to 1/5, establishing that dynamic tool execution is necessary for accurate prognostic reasoning while orchestration and sequencing remain the dominant failure modes.
What carries the argument
MCP-native tool wrappers that directly execute published PHM algorithms on asset telemetry data.
If this is right
- Orchestration and tool-sequencing errors account for the majority of failures across frameworks and backbones.
- Frontier LLMs call individual tools correctly more often than they plan the right sequence of calls.
- Smaller open-weight models produce more schema-invalid tool calls than larger models.
- Replacing MCP execution with text RAG causes large accuracy losses on computation-heavy tasks such as remaining useful life estimation.
- Deterministic evaluators on public datasets enable reproducible comparisons of agent frameworks.
Where Pith is reading between the lines
- Reliable agents for technical domains will need separate planning components that operate alongside tool invocation rather than relying on the language model alone.
- The public release of scenarios and tools allows direct testing of whether targeted orchestration improvements can close the remaining gap to near-perfect reliability.
- Similar native-tool benchmarks could be constructed for other fields that require agents to run precise calculations on sensor streams.
- Strong results on public datasets such as NASA PCoE suggest the same architecture could transfer to proprietary industrial settings once orchestration improves.
Load-bearing premise
The 99 expert-authored scenarios and 39 tool wrappers faithfully represent real industrial PHM challenges and correctly implement the cited algorithms without selection bias or implementation artifacts.
What would settle it
A new evaluation on additional PHM scenarios or alternative published algorithms that yields pass@1 rates below 60% for the same agent configurations would show the reported performance does not generalize.
Figures
read the original abstract
LLM agents are beginning to invoke industrial asset-management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety-critical \emph{Prognostics and Health Management (PHM)} is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce \textbf{PHMForge}, an evaluation environment that closes each conflation. PHMForge ships 99 SME-authored scenarios across eight industrial asset classes spanning rotating equipment, aero-engines, and lithium-ion cells, on public datasets including NASA PCoE, served through 39 MCP-native tools wrapping published PHM algorithms (C-MAPSS, ISO~10816, Arrhenius capacity-fade models, time-series foundation models). Krippendorff's $\alpha \in [0.74,\,0.82]$ on a 30-scenario stratified rotating-equipment/aero-engine sample; the battery extension is single-rater. Across three agentic frameworks and six LLM backbones, the strongest configuration reaches \textbf{80.8\% pass@1}, with the residual gap concentrated in orchestration and tool-sequencing errors. Crucially, an architectural ablation shows that replacing MCP execution with text-based Retrieval-Augmented Generation (RAG) over telemetry-equivalent evidence collapses Remaining Useful Life \emph{pass-all-3} from \textbf{100\% to 20\%} (5/5 vs.\ 1/5) on the battery class, exposing the structural limits of static retrieval for prognostic computation. Trajectory decomposition shows orchestration errors dominate failures across backbones, while schema-invalid tool calls concentrate in smaller open-weight models. Frontier LLMs are stronger at calling tools than at planning when to call them. PHMForge is open-sourced with deterministic evaluators, a public leaderboard, and a datasheet.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PHMForge, an evaluation environment for LLM agents on industrial Prognostics and Health Management (PHM) tasks. It supplies 99 SME-authored scenarios across eight asset classes (rotating equipment, aero-engines, lithium-ion cells) served by 39 MCP-native tools that wrap published algorithms (C-MAPSS, ISO 10816, Arrhenius models, time-series foundation models). Across three agentic frameworks and six LLM backbones the best configuration reaches 80.8% pass@1, with residual failures concentrated in orchestration and tool sequencing. An architectural ablation shows that replacing MCP execution with text-based RAG over telemetry-equivalent evidence drops Remaining Useful Life pass-all-3 from 100% (5/5) to 20% (1/5) on the battery class. Krippendorff’s α is reported in [0.74, 0.82] for a 30-scenario rotating/aero sample; the battery extension is single-rater. The benchmark, deterministic evaluators, and public leaderboard are open-sourced.
Significance. If the central empirical claims hold, the work supplies a reproducible, algorithm-grounded benchmark that isolates protocol fluency from reasoning and tool execution from retrieval. The open release of scenarios, wrappers, evaluators, and leaderboard is a clear strength that enables direct follow-up. The ablation result, if robust, would indicate that static retrieval is structurally insufficient for dynamic prognostic computation, a finding with direct implications for agent design in safety-critical domains.
major comments (1)
- [Architectural ablation (results section)] The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.
minor comments (2)
- [Evaluation metrics] Clarify the precise definition and computation of “pass-all-3” versus pass@1, including how partial successes are scored.
- [Benchmark construction] Add a brief description of scenario construction, inclusion/exclusion criteria, and any steps taken to mitigate selection bias in the 99 scenarios.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the importance of robust validation for the architectural ablation. We address the single major comment below and will incorporate the suggested strengthening in the revision.
read point-by-point responses
-
Referee: The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.
Authors: We agree that the ablation currently rests on only five battery scenarios with single-rater evaluation, while Krippendorff’s α is provided solely for the 30-scenario rotating-equipment/aero-engine sample. Although the pass-all-3 outcomes are produced by deterministic evaluators grounded in published algorithms (thereby limiting outcome subjectivity), we acknowledge that the small n leaves room for sampling variability. In the revised manuscript we will expand the battery ablation to at least 20 scenarios, obtain multi-rater annotations for this class, and report the corresponding Krippendorff’s α to strengthen the structural claim. revision: yes
Circularity Check
No circularity: empirical benchmark on external datasets and published algorithms
full rationale
The paper presents an empirical evaluation framework (PHMForge) with 99 SME-authored scenarios and 39 tool wrappers around independently published PHM algorithms (C-MAPSS, ISO 10816, Arrhenius models, etc.). All reported metrics (80.8% pass@1, RUL pass-all-3 rates, ablation results) are direct observational measurements on public datasets; no equations, fitted parameters, or predictions are derived from the paper's own outputs. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citation chains appear in the derivation chain. The MCP-vs-RAG ablation is a controlled comparison on fixed scenarios, not a reduction to internal construction. This is a standard honest empirical result with no circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SME-authored scenarios accurately represent real industrial PHM challenges across asset classes
- domain assumption The 39 MCP-native tools correctly wrap and execute the published PHM algorithms
Reference graph
Works this paper leans on
-
[1]
Anonymous. 2026. PDMBench: A Standardized Platform for Predictive Mainte- nance Research. https://openreview.net/forum?id=oJhj8wOCNB
work page 2026
-
[2]
Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/ claude-4 Accessed: 2026-01-03
work page 2025
-
[3]
Anysphere. 2024. Cursor Documentation: The AI Code Editor. https://cursor. com/docs Accessed: 2026-01-01
work page 2024
-
[4]
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2025. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/f...
work page 2025
-
[5]
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InThe Thirty-ninth Annual PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Conference on Neural Information Processin...
work page 2025
-
[6]
2018.Condition Monitoring and Diagnostics of Machines: General Guidelines
International Organization for Standardization. 2018.Condition Monitoring and Diagnostics of Machines: General Guidelines. International Organization for Standardization. https://books.google.com/books?id=Y3pXzQEACAAJ
work page 2018
-
[7]
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. StableToolBench: Towards Stable Large- Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...
- [8]
-
[9]
Huggingface. 2026. Datasets. https://huggingface.co/datasets Accessed: 2026- 02-08
work page 2026
-
[10]
IBM Research. 2025. Enterprise Operations Bench Leaderboard. Kaggle. https: //www.kaggle.com/benchmarks/ibm-research/enterprise-ops Accessed: 01-01- 2026
work page 2025
-
[11]
UC Irvine. 2026. UC Irvine Machine Learning Repository. https://archive.ics. uci.edu/ Accessed: 2026-02-07
work page 2026
-
[12]
kaggle. 2026. datasets. https://www.kaggle.com/datasets Accessed: 2026-01-10
work page 2026
-
[13]
2017.Prognostics and Health Management of Engineering Systems
Nam Kim, Dawn An, and Joo-Ho Choi. 2017.Prognostics and Health Management of Engineering Systems. doi:10.1007/978-3-319-44742-1
-
[14]
Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704 [cs.AI] https://arxiv.org/abs/2508. 14704
-
[15]
Model Context Protocol Steering Group. 2024. Getting Started with the Model Context Protocol. https://modelcontextprotocol.io/docs/getting-started/intro. Accessed: 2026-02-02
work page 2024
-
[16]
NASAOpenSourceRepo. 2026. NASA Prognostics Center of Excellence Data Set Repository. https://www.nasa.gov/intelligent-systems-division/discovery-and- systems-health/pcoe/pcoe-data-set-repository/ Accessed: 2026-02-03
work page 2026
-
[17]
OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-04
work page 2024
-
[18]
Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. 2025. AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance. arXiv:2506.03828 [cs.AI] https://arxiv.org/abs/2506.03828
- [19]
-
[20]
Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C
James T. Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C. Patel. 2025. ReAct Meets Industrial IoT: Language Agents for Data Access. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. https://openreview.net/forum?id=luETrQw0j6
work page 2025
-
[21]
Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, Suman Nath, Chetan Bansal, and Saravan Rajmohan. 2024. Building AI Agents for Autonomous Clouds: Challenges and Design Principles. arXiv:2407.12165 [cs.SE] https://arxiv.org/abs/2407.12165
-
[22]
PHM Society. 2026. International Journal of Prognostics and Health Management. https://papers.phmsociety.org/index.php/ijphm/issue/archive Accessed: 2026-02- 04
work page 2026
-
[23]
Jonathan van der Ven. 2021. Awesome Industrial Datasets. https://github.com/ jonathanwvd/awesome-industrial-datasets. Accessed: 2026-02-08
work page 2021
-
[24]
Yilin wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, LEI JIA, Yuanxiang Li, and zhongyu wei. 2025. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=GByP03IitA
work page 2025
-
[25]
Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow
-
[26]
MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real- World Tasks via MCP Servers. arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453
-
[27]
John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=mXpq6ut8J3
work page 2024
-
[28]
Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. 2025. PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management. arXiv:2508.02490 [cs.AI] https://arxiv.org/abs/2508.02490
-
[29]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. InNeurIPS 2022 Foundation Models for Decision Making Workshop. https:// openreview.net/forum?id=tvI4u1ylcqs A Tool Specifications PHMForge exposes 22 specialized tools via two Model Context Pro- tocol (MCP...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.