pith. sign in

arxiv: 2604.01532 · v2 · submitted 2026-04-02 · 💻 cs.AI

PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Pith reviewed 2026-05-13 22:00 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsprognosticshealth managementMCPbenchmarktool executionRAG ablationremaining useful life
0
0 comments X

The pith

LLM agents reach 80.8% success on industrial prognostics tasks with executable MCP tools, but drop to 20% when limited to text retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PHMForge supplies 99 expert-written scenarios across eight asset classes and 39 tools that run published PHM algorithms directly through the Model Context Protocol. The best agent and model combinations solve 80.8% of the tasks on the first attempt, with most remaining errors traced to incorrect ordering of tool calls rather than misuse of any single tool. An ablation that swaps live tool execution for text retrieval over matching telemetry data lowers battery remaining-useful-life accuracy from 100% to 20%. The results indicate that agents require active algorithmic computation, not static lookup, to handle prognostic calculations reliably. The full environment, evaluators, and leaderboard are released publicly.

Core claim

PHMForge shows that LLM agents achieve 80.8% pass@1 on 99 SME-authored PHM scenarios when they can invoke 39 MCP-native wrappers around published algorithms such as C-MAPSS and Arrhenius models. Replacing those executable calls with text-based RAG over equivalent evidence reduces Remaining Useful Life pass-all-3 on the battery class from 5/5 to 1/5, establishing that dynamic tool execution is necessary for accurate prognostic reasoning while orchestration and sequencing remain the dominant failure modes.

What carries the argument

MCP-native tool wrappers that directly execute published PHM algorithms on asset telemetry data.

If this is right

  • Orchestration and tool-sequencing errors account for the majority of failures across frameworks and backbones.
  • Frontier LLMs call individual tools correctly more often than they plan the right sequence of calls.
  • Smaller open-weight models produce more schema-invalid tool calls than larger models.
  • Replacing MCP execution with text RAG causes large accuracy losses on computation-heavy tasks such as remaining useful life estimation.
  • Deterministic evaluators on public datasets enable reproducible comparisons of agent frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Reliable agents for technical domains will need separate planning components that operate alongside tool invocation rather than relying on the language model alone.
  • The public release of scenarios and tools allows direct testing of whether targeted orchestration improvements can close the remaining gap to near-perfect reliability.
  • Similar native-tool benchmarks could be constructed for other fields that require agents to run precise calculations on sensor streams.
  • Strong results on public datasets such as NASA PCoE suggest the same architecture could transfer to proprietary industrial settings once orchestration improves.

Load-bearing premise

The 99 expert-authored scenarios and 39 tool wrappers faithfully represent real industrial PHM challenges and correctly implement the cited algorithms without selection bias or implementation artifacts.

What would settle it

A new evaluation on additional PHM scenarios or alternative published algorithms that yields pass@1 rates below 60% for the same agent configurations would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2604.01532 by Ayan Das, Chun-Yi Tsai, Dhaval Patel, Kaoutar El Maghraoui, Shuxin Lin, Tianjun Feng, Yihan Sun, Yunfeng Chen.

Figure 1
Figure 1. Figure 1: PHMForge benchmark results across framework-model combinations. Even top configurations (Claude Code + Sonnet [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PHMForge Architecture. Industrial stakeholder queries are processed by agentic frameworks and LLM backbones, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Process behind how we expanded our scenarios, choice of datasets and corresponding tasks associated with the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scenario Generation Pipeline examples of open-ended ground truth and 1 example of close-ended ground truth. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ground truth structure for open-ended (RUL Prediction, Fault Classification) and close-ended (Engine Health Analysis) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

LLM agents are beginning to invoke industrial asset-management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety-critical \emph{Prognostics and Health Management (PHM)} is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce \textbf{PHMForge}, an evaluation environment that closes each conflation. PHMForge ships 99 SME-authored scenarios across eight industrial asset classes spanning rotating equipment, aero-engines, and lithium-ion cells, on public datasets including NASA PCoE, served through 39 MCP-native tools wrapping published PHM algorithms (C-MAPSS, ISO~10816, Arrhenius capacity-fade models, time-series foundation models). Krippendorff's $\alpha \in [0.74,\,0.82]$ on a 30-scenario stratified rotating-equipment/aero-engine sample; the battery extension is single-rater. Across three agentic frameworks and six LLM backbones, the strongest configuration reaches \textbf{80.8\% pass@1}, with the residual gap concentrated in orchestration and tool-sequencing errors. Crucially, an architectural ablation shows that replacing MCP execution with text-based Retrieval-Augmented Generation (RAG) over telemetry-equivalent evidence collapses Remaining Useful Life \emph{pass-all-3} from \textbf{100\% to 20\%} (5/5 vs.\ 1/5) on the battery class, exposing the structural limits of static retrieval for prognostic computation. Trajectory decomposition shows orchestration errors dominate failures across backbones, while schema-invalid tool calls concentrate in smaller open-weight models. Frontier LLMs are stronger at calling tools than at planning when to call them. PHMForge is open-sourced with deterministic evaluators, a public leaderboard, and a datasheet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PHMForge, an evaluation environment for LLM agents on industrial Prognostics and Health Management (PHM) tasks. It supplies 99 SME-authored scenarios across eight asset classes (rotating equipment, aero-engines, lithium-ion cells) served by 39 MCP-native tools that wrap published algorithms (C-MAPSS, ISO 10816, Arrhenius models, time-series foundation models). Across three agentic frameworks and six LLM backbones the best configuration reaches 80.8% pass@1, with residual failures concentrated in orchestration and tool sequencing. An architectural ablation shows that replacing MCP execution with text-based RAG over telemetry-equivalent evidence drops Remaining Useful Life pass-all-3 from 100% (5/5) to 20% (1/5) on the battery class. Krippendorff’s α is reported in [0.74, 0.82] for a 30-scenario rotating/aero sample; the battery extension is single-rater. The benchmark, deterministic evaluators, and public leaderboard are open-sourced.

Significance. If the central empirical claims hold, the work supplies a reproducible, algorithm-grounded benchmark that isolates protocol fluency from reasoning and tool execution from retrieval. The open release of scenarios, wrappers, evaluators, and leaderboard is a clear strength that enables direct follow-up. The ablation result, if robust, would indicate that static retrieval is structurally insufficient for dynamic prognostic computation, a finding with direct implications for agent design in safety-critical domains.

major comments (1)
  1. [Architectural ablation (results section)] The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.
minor comments (2)
  1. [Evaluation metrics] Clarify the precise definition and computation of “pass-all-3” versus pass@1, including how partial successes are scored.
  2. [Benchmark construction] Add a brief description of scenario construction, inclusion/exclusion criteria, and any steps taken to mitigate selection bias in the 99 scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of robust validation for the architectural ablation. We address the single major comment below and will incorporate the suggested strengthening in the revision.

read point-by-point responses
  1. Referee: The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.

    Authors: We agree that the ablation currently rests on only five battery scenarios with single-rater evaluation, while Krippendorff’s α is provided solely for the 30-scenario rotating-equipment/aero-engine sample. Although the pass-all-3 outcomes are produced by deterministic evaluators grounded in published algorithms (thereby limiting outcome subjectivity), we acknowledge that the small n leaves room for sampling variability. In the revised manuscript we will expand the battery ablation to at least 20 scenarios, obtain multi-rater annotations for this class, and report the corresponding Krippendorff’s α to strengthen the structural claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark on external datasets and published algorithms

full rationale

The paper presents an empirical evaluation framework (PHMForge) with 99 SME-authored scenarios and 39 tool wrappers around independently published PHM algorithms (C-MAPSS, ISO 10816, Arrhenius models, etc.). All reported metrics (80.8% pass@1, RUL pass-all-3 rates, ablation results) are direct observational measurements on public datasets; no equations, fitted parameters, or predictions are derived from the paper's own outputs. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citation chains appear in the derivation chain. The MCP-vs-RAG ablation is a controlled comparison on fixed scenarios, not a reduction to internal construction. This is a standard honest empirical result with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark validity rests on two domain assumptions about scenario representativeness and tool correctness; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption SME-authored scenarios accurately represent real industrial PHM challenges across asset classes
    Invoked to justify the 99 scenarios as the evaluation substrate.
  • domain assumption The 39 MCP-native tools correctly wrap and execute the published PHM algorithms
    Central to claiming the evaluation measures genuine prognostic reasoning rather than tool artifacts.

pith-pipeline@v0.9.0 · 5678 in / 1289 out tokens · 47591 ms · 2026-05-13T22:00:41.720438+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Anonymous. 2026. PDMBench: A Standardized Platform for Predictive Mainte- nance Research. https://openreview.net/forum?id=oJhj8wOCNB

  2. [2]

    Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/ claude-4 Accessed: 2026-01-03

  3. [3]

    Anysphere. 2024. Cursor Documentation: The AI Code Editor. https://cursor. com/docs Accessed: 2026-01-01

  4. [4]

    Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2025. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/f...

  5. [5]

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InThe Thirty-ninth Annual PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Conference on Neural Information Processin...

  6. [6]

    2018.Condition Monitoring and Diagnostics of Machines: General Guidelines

    International Organization for Standardization. 2018.Condition Monitoring and Diagnostics of Machines: General Guidelines. International Organization for Standardization. https://books.google.com/books?id=Y3pXzQEACAAJ

  7. [7]

    Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. StableToolBench: Towards Stable Large- Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...

  8. [8]

    Kyle Hamilton and Ali Intizar. 2026. Neuro-symbolic AI for Predictive Mainte- nance (PdM) – review and recommendations. arXiv:2602.00731 [cs.AI] https: //arxiv.org/abs/2602.00731

  9. [9]

    Huggingface. 2026. Datasets. https://huggingface.co/datasets Accessed: 2026- 02-08

  10. [10]

    IBM Research. 2025. Enterprise Operations Bench Leaderboard. Kaggle. https: //www.kaggle.com/benchmarks/ibm-research/enterprise-ops Accessed: 01-01- 2026

  11. [11]

    UC Irvine. 2026. UC Irvine Machine Learning Repository. https://archive.ics. uci.edu/ Accessed: 2026-02-07

  12. [12]

    kaggle. 2026. datasets. https://www.kaggle.com/datasets Accessed: 2026-01-10

  13. [13]

    2017.Prognostics and Health Management of Engineering Systems

    Nam Kim, Dawn An, and Joo-Ho Choi. 2017.Prognostics and Health Management of Engineering Systems. doi:10.1007/978-3-319-44742-1

  14. [14]

    Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704 [cs.AI] https://arxiv.org/abs/2508. 14704

  15. [15]

    Model Context Protocol Steering Group. 2024. Getting Started with the Model Context Protocol. https://modelcontextprotocol.io/docs/getting-started/intro. Accessed: 2026-02-02

  16. [16]

    NASAOpenSourceRepo. 2026. NASA Prognostics Center of Excellence Data Set Repository. https://www.nasa.gov/intelligent-systems-division/discovery-and- systems-health/pcoe/pcoe-data-set-repository/ Accessed: 2026-02-03

  17. [17]

    OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-04

  18. [18]

    Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. 2025. AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance. arXiv:2506.03828 [cs.AI] https://arxiv.org/abs/2506.03828

  19. [19]

    Dhaval Patel, Nianjun Zhou, Shrey Shrivastava, and Jayant Kalagnanam. 2020. Doctor for Machines: A Failure Pattern Analysis Solution for Industry 4.0. In 2020 IEEE International Conference on Big Data (Big Data). 1614–1623. doi:10. 1109/BigData50022.2020.9378369

  20. [20]

    Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C

    James T. Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C. Patel. 2025. ReAct Meets Industrial IoT: Language Agents for Data Access. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. https://openreview.net/forum?id=luETrQw0j6

  21. [21]

    Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, Suman Nath, Chetan Bansal, and Saravan Rajmohan. 2024. Building AI Agents for Autonomous Clouds: Challenges and Design Principles. arXiv:2407.12165 [cs.SE] https://arxiv.org/abs/2407.12165

  22. [22]

    PHM Society. 2026. International Journal of Prognostics and Health Management. https://papers.phmsociety.org/index.php/ijphm/issue/archive Accessed: 2026-02- 04

  23. [23]

    Jonathan van der Ven. 2021. Awesome Industrial Datasets. https://github.com/ jonathanwvd/awesome-industrial-datasets. Accessed: 2026-02-08

  24. [24]

    Yilin wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, LEI JIA, Yuanxiang Li, and zhongyu wei. 2025. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=GByP03IitA

  25. [25]

    Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow

  26. [26]

    Wang et al

    MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real- World Tasks via MCP Servers. arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

  27. [27]

    John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=mXpq6ut8J3

  28. [28]

    Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. 2025. PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management. arXiv:2508.02490 [cs.AI] https://arxiv.org/abs/2508.02490

  29. [29]

    What factors are con- tributing to premature bearing failures under high-load conditions in the HUST dataset?

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. InNeurIPS 2022 Foundation Models for Decision Making Workshop. https:// openreview.net/forum?id=tvI4u1ylcqs A Tool Specifications PHMForge exposes 22 specialized tools via two Model Context Pro- tocol (MCP...