PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Ayan Das; Chun-Yi Tsai; Dhaval Patel; Kaoutar El Maghraoui; Shuxin Lin; Tianjun Feng; Yihan Sun; Yunfeng Chen

arxiv: 2604.01532 · v2 · submitted 2026-04-02 · 💻 cs.AI

PHMForge: Evaluating LLM Agents on Industrial Prognostics through MCP-Native, Algorithm-Grounded Tools

Tianjun Feng , Yunfeng Chen , Chun-Yi Tsai , Yihan Sun , Ayan Das , Kaoutar El Maghraoui , Shuxin Lin , Dhaval Patel This is my paper

Pith reviewed 2026-05-13 22:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsprognosticshealth managementMCPbenchmarktool executionRAG ablationremaining useful life

0 comments

The pith

LLM agents reach 80.8% success on industrial prognostics tasks with executable MCP tools, but drop to 20% when limited to text retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

PHMForge supplies 99 expert-written scenarios across eight asset classes and 39 tools that run published PHM algorithms directly through the Model Context Protocol. The best agent and model combinations solve 80.8% of the tasks on the first attempt, with most remaining errors traced to incorrect ordering of tool calls rather than misuse of any single tool. An ablation that swaps live tool execution for text retrieval over matching telemetry data lowers battery remaining-useful-life accuracy from 100% to 20%. The results indicate that agents require active algorithmic computation, not static lookup, to handle prognostic calculations reliably. The full environment, evaluators, and leaderboard are released publicly.

Core claim

PHMForge shows that LLM agents achieve 80.8% pass@1 on 99 SME-authored PHM scenarios when they can invoke 39 MCP-native wrappers around published algorithms such as C-MAPSS and Arrhenius models. Replacing those executable calls with text-based RAG over equivalent evidence reduces Remaining Useful Life pass-all-3 on the battery class from 5/5 to 1/5, establishing that dynamic tool execution is necessary for accurate prognostic reasoning while orchestration and sequencing remain the dominant failure modes.

What carries the argument

MCP-native tool wrappers that directly execute published PHM algorithms on asset telemetry data.

If this is right

Orchestration and tool-sequencing errors account for the majority of failures across frameworks and backbones.
Frontier LLMs call individual tools correctly more often than they plan the right sequence of calls.
Smaller open-weight models produce more schema-invalid tool calls than larger models.
Replacing MCP execution with text RAG causes large accuracy losses on computation-heavy tasks such as remaining useful life estimation.
Deterministic evaluators on public datasets enable reproducible comparisons of agent frameworks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Reliable agents for technical domains will need separate planning components that operate alongside tool invocation rather than relying on the language model alone.
The public release of scenarios and tools allows direct testing of whether targeted orchestration improvements can close the remaining gap to near-perfect reliability.
Similar native-tool benchmarks could be constructed for other fields that require agents to run precise calculations on sensor streams.
Strong results on public datasets such as NASA PCoE suggest the same architecture could transfer to proprietary industrial settings once orchestration improves.

Load-bearing premise

The 99 expert-authored scenarios and 39 tool wrappers faithfully represent real industrial PHM challenges and correctly implement the cited algorithms without selection bias or implementation artifacts.

What would settle it

A new evaluation on additional PHM scenarios or alternative published algorithms that yields pass@1 rates below 60% for the same agent configurations would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2604.01532 by Ayan Das, Chun-Yi Tsai, Dhaval Patel, Kaoutar El Maghraoui, Shuxin Lin, Tianjun Feng, Yihan Sun, Yunfeng Chen.

**Figure 1.** Figure 1: PHMForge benchmark results across framework-model combinations. Even top configurations (Claude Code + Sonnet [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: PHMForge Architecture. Industrial stakeholder queries are processed by agentic frameworks and LLM backbones, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Process behind how we expanded our scenarios, choice of datasets and corresponding tasks associated with the [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Scenario Generation Pipeline examples of open-ended ground truth and 1 example of close-ended ground truth. As shown in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Ground truth structure for open-ended (RUL Prediction, Fault Classification) and close-ended (Engine Health Analysis) [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

LLM agents are beginning to invoke industrial asset-management tools through the Model Context Protocol (MCP), yet whether they can act reliably on this substrate for safety-critical \emph{Prognostics and Health Management (PHM)} is unanswered. Prior benchmarks conflate protocol fluency with reasoning, instrumentation failures with agent failures, and tool use with tool retrieval. We introduce \textbf{PHMForge}, an evaluation environment that closes each conflation. PHMForge ships 99 SME-authored scenarios across eight industrial asset classes spanning rotating equipment, aero-engines, and lithium-ion cells, on public datasets including NASA PCoE, served through 39 MCP-native tools wrapping published PHM algorithms (C-MAPSS, ISO~10816, Arrhenius capacity-fade models, time-series foundation models). Krippendorff's $\alpha \in [0.74,\,0.82]$ on a 30-scenario stratified rotating-equipment/aero-engine sample; the battery extension is single-rater. Across three agentic frameworks and six LLM backbones, the strongest configuration reaches \textbf{80.8\% pass@1}, with the residual gap concentrated in orchestration and tool-sequencing errors. Crucially, an architectural ablation shows that replacing MCP execution with text-based Retrieval-Augmented Generation (RAG) over telemetry-equivalent evidence collapses Remaining Useful Life \emph{pass-all-3} from \textbf{100\% to 20\%} (5/5 vs.\ 1/5) on the battery class, exposing the structural limits of static retrieval for prognostic computation. Trajectory decomposition shows orchestration errors dominate failures across backbones, while schema-invalid tool calls concentrate in smaller open-weight models. Frontier LLMs are stronger at calling tools than at planning when to call them. PHMForge is open-sourced with deterministic evaluators, a public leaderboard, and a datasheet.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PHMForge gives a new grounded benchmark for LLM agents on industrial prognostics but its key claim about MCP versus RAG rests on five single-rater battery scenarios.

read the letter

The paper introduces PHMForge, a benchmark with 99 SME-authored scenarios across eight asset classes and 39 MCP-native tools that wrap published algorithms such as C-MAPSS, ISO 10816, and Arrhenius capacity-fade models on public datasets like NASA PCoE. It separates protocol fluency from reasoning failures and reports 80.8% pass@1 on the strongest agent-LLM combination, with most errors falling into orchestration and tool sequencing rather than invalid calls. The open release with deterministic evaluators and a public leaderboard makes the setup usable for follow-on work. Agreement metrics reach Krippendorff alpha of 0.74-0.82 on the 30-scenario rotating and aero sample, which is reasonable for this kind of evaluation. The ablation showing MCP execution beating text RAG on battery remaining useful life (100% to 20% pass-all-3) is the part that stands out, but it uses only five scenarios rated by one person while the agreement numbers come from a separate set. That small sample leaves room for scenario selection or rater effects to drive the gap instead of a structural limit of static retrieval. The rest of the results on the larger set appear more stable, and the work avoids circularity by tying everything to external public data and existing algorithms. This is for researchers testing agent reliability on safety-critical maintenance tasks. A reader focused on practical benchmarks will get concrete scenarios and error breakdowns worth examining. It deserves peer review because the benchmark construction is substantive and the data is public, even if the ablation needs more runs to support the architectural conclusion.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces PHMForge, an evaluation environment for LLM agents on industrial Prognostics and Health Management (PHM) tasks. It supplies 99 SME-authored scenarios across eight asset classes (rotating equipment, aero-engines, lithium-ion cells) served by 39 MCP-native tools that wrap published algorithms (C-MAPSS, ISO 10816, Arrhenius models, time-series foundation models). Across three agentic frameworks and six LLM backbones the best configuration reaches 80.8% pass@1, with residual failures concentrated in orchestration and tool sequencing. An architectural ablation shows that replacing MCP execution with text-based RAG over telemetry-equivalent evidence drops Remaining Useful Life pass-all-3 from 100% (5/5) to 20% (1/5) on the battery class. Krippendorff’s α is reported in [0.74, 0.82] for a 30-scenario rotating/aero sample; the battery extension is single-rater. The benchmark, deterministic evaluators, and public leaderboard are open-sourced.

Significance. If the central empirical claims hold, the work supplies a reproducible, algorithm-grounded benchmark that isolates protocol fluency from reasoning and tool execution from retrieval. The open release of scenarios, wrappers, evaluators, and leaderboard is a clear strength that enables direct follow-up. The ablation result, if robust, would indicate that static retrieval is structurally insufficient for dynamic prognostic computation, a finding with direct implications for agent design in safety-critical domains.

major comments (1)

[Architectural ablation (results section)] The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.

minor comments (2)

[Evaluation metrics] Clarify the precise definition and computation of “pass-all-3” versus pass@1, including how partial successes are scored.
[Benchmark construction] Add a brief description of scenario construction, inclusion/exclusion criteria, and any steps taken to mitigate selection bias in the 99 scenarios.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of robust validation for the architectural ablation. We address the single major comment below and will incorporate the suggested strengthening in the revision.

read point-by-point responses

Referee: The architectural ablation claiming that MCP execution is required for reliable RUL prediction (pass-all-3 falling from 100% to 20%) rests exclusively on n=5 battery scenarios evaluated by a single rater. Krippendorff’s α is reported only for the 30-scenario rotating/aero sample; no multi-rater or larger-n validation is provided for the battery class. This small sample is load-bearing for the structural-limit conclusion yet leaves open sampling variability, scenario selection effects, or rater-specific judgment as alternative explanations.

Authors: We agree that the ablation currently rests on only five battery scenarios with single-rater evaluation, while Krippendorff’s α is provided solely for the 30-scenario rotating-equipment/aero-engine sample. Although the pass-all-3 outcomes are produced by deterministic evaluators grounded in published algorithms (thereby limiting outcome subjectivity), we acknowledge that the small n leaves room for sampling variability. In the revised manuscript we will expand the battery ablation to at least 20 scenarios, obtain multi-rater annotations for this class, and report the corresponding Krippendorff’s α to strengthen the structural claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark on external datasets and published algorithms

full rationale

The paper presents an empirical evaluation framework (PHMForge) with 99 SME-authored scenarios and 39 tool wrappers around independently published PHM algorithms (C-MAPSS, ISO 10816, Arrhenius models, etc.). All reported metrics (80.8% pass@1, RUL pass-all-3 rates, ablation results) are direct observational measurements on public datasets; no equations, fitted parameters, or predictions are derived from the paper's own outputs. No self-definitional loops, fitted-input-as-prediction, or load-bearing self-citation chains appear in the derivation chain. The MCP-vs-RAG ablation is a controlled comparison on fixed scenarios, not a reduction to internal construction. This is a standard honest empirical result with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark validity rests on two domain assumptions about scenario representativeness and tool correctness; no free parameters or invented entities are introduced.

axioms (2)

domain assumption SME-authored scenarios accurately represent real industrial PHM challenges across asset classes
Invoked to justify the 99 scenarios as the evaluation substrate.
domain assumption The 39 MCP-native tools correctly wrap and execute the published PHM algorithms
Central to claiming the evaluation measures genuine prognostic reasoning rather than tool artifacts.

pith-pipeline@v0.9.0 · 5678 in / 1289 out tokens · 47591 ms · 2026-05-13T22:00:41.720438+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Anonymous. 2026. PDMBench: A Standardized Platform for Predictive Mainte- nance Research. https://openreview.net/forum?id=oJhj8wOCNB

work page 2026
[2]

Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/ claude-4 Accessed: 2026-01-03

work page 2025
[3]

Anysphere. 2024. Cursor Documentation: The AI Code Editor. https://cursor. com/docs Accessed: 2026-01-01

work page 2024
[4]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2025. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/f...

work page 2025
[5]

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InThe Thirty-ninth Annual PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Conference on Neural Information Processin...

work page 2025
[6]

2018.Condition Monitoring and Diagnostics of Machines: General Guidelines

International Organization for Standardization. 2018.Condition Monitoring and Diagnostics of Machines: General Guidelines. International Organization for Standardization. https://books.google.com/books?id=Y3pXzQEACAAJ

work page 2018
[7]

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. StableToolBench: Towards Stable Large- Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...

work page doi:10.18653/v1/2024.findings-acl.664 2024
[8]

Kyle Hamilton and Ali Intizar. 2026. Neuro-symbolic AI for Predictive Mainte- nance (PdM) – review and recommendations. arXiv:2602.00731 [cs.AI] https: //arxiv.org/abs/2602.00731

work page arXiv 2026
[9]

Huggingface. 2026. Datasets. https://huggingface.co/datasets Accessed: 2026- 02-08

work page 2026
[10]

IBM Research. 2025. Enterprise Operations Bench Leaderboard. Kaggle. https: //www.kaggle.com/benchmarks/ibm-research/enterprise-ops Accessed: 01-01- 2026

work page 2025
[11]

UC Irvine. 2026. UC Irvine Machine Learning Repository. https://archive.ics. uci.edu/ Accessed: 2026-02-07

work page 2026
[12]

kaggle. 2026. datasets. https://www.kaggle.com/datasets Accessed: 2026-01-10

work page 2026
[13]

2017.Prognostics and Health Management of Engineering Systems

Nam Kim, Dawn An, and Joo-Ho Choi. 2017.Prognostics and Health Management of Engineering Systems. doi:10.1007/978-3-319-44742-1

work page doi:10.1007/978-3-319-44742-1 2017
[14]

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704 [cs.AI] https://arxiv.org/abs/2508. 14704

work page arXiv 2025
[15]

Model Context Protocol Steering Group. 2024. Getting Started with the Model Context Protocol. https://modelcontextprotocol.io/docs/getting-started/intro. Accessed: 2026-02-02

work page 2024
[16]

NASAOpenSourceRepo. 2026. NASA Prognostics Center of Excellence Data Set Repository. https://www.nasa.gov/intelligent-systems-division/discovery-and- systems-health/pcoe/pcoe-data-set-repository/ Accessed: 2026-02-03

work page 2026
[17]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-04

work page 2024
[18]

Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. 2025. AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance. arXiv:2506.03828 [cs.AI] https://arxiv.org/abs/2506.03828

work page arXiv 2025
[19]

Dhaval Patel, Nianjun Zhou, Shrey Shrivastava, and Jayant Kalagnanam. 2020. Doctor for Machines: A Failure Pattern Analysis Solution for Industry 4.0. In 2020 IEEE International Conference on Big Data (Big Data). 1614–1623. doi:10. 1109/BigData50022.2020.9378369

work page arXiv 2020
[20]

Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C

James T. Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C. Patel. 2025. ReAct Meets Industrial IoT: Language Agents for Data Access. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. https://openreview.net/forum?id=luETrQw0j6

work page 2025
[21]

Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, Suman Nath, Chetan Bansal, and Saravan Rajmohan. 2024. Building AI Agents for Autonomous Clouds: Challenges and Design Principles. arXiv:2407.12165 [cs.SE] https://arxiv.org/abs/2407.12165

work page arXiv 2024
[22]

PHM Society. 2026. International Journal of Prognostics and Health Management. https://papers.phmsociety.org/index.php/ijphm/issue/archive Accessed: 2026-02- 04

work page 2026
[23]

Jonathan van der Ven. 2021. Awesome Industrial Datasets. https://github.com/ jonathanwvd/awesome-industrial-datasets. Accessed: 2026-02-08

work page 2021
[24]

Yilin wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, LEI JIA, Yuanxiang Li, and zhongyu wei. 2025. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=GByP03IitA

work page 2025
[25]

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow

work page
[26]

Wang et al

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real- World Tasks via MCP Servers. arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

work page arXiv
[27]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=mXpq6ut8J3

work page 2024
[28]

Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. 2025. PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management. arXiv:2508.02490 [cs.AI] https://arxiv.org/abs/2508.02490

work page arXiv 2025
[29]

What factors are con- tributing to premature bearing failures under high-load conditions in the HUST dataset?

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. InNeurIPS 2022 Foundation Models for Decision Making Workshop. https:// openreview.net/forum?id=tvI4u1ylcqs A Tool Specifications PHMForge exposes 22 specialized tools via two Model Context Pro- tocol (MCP...

work page 2022

[1] [1]

Anonymous. 2026. PDMBench: A Standardized Platform for Predictive Mainte- nance Research. https://openreview.net/forum?id=oJhj8wOCNB

work page 2026

[2] [2]

Anthropic. 2025. Introducing Claude 4. https://www.anthropic.com/news/ claude-4 Accessed: 2026-01-03

work page 2025

[3] [3]

Anysphere. 2024. Cursor Documentation: The AI Code Editor. https://cursor. com/docs Accessed: 2026-01-01

work page 2024

[4] [4]

Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Aleksander Madry, and Lilian Weng. 2025. MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/f...

work page 2025

[5] [5]

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Pra- teek Mutalik Desai, David Salinas, and Frank Hutter. 2025. TabArena: A Living Benchmark for Machine Learning on Tabular Data. InThe Thirty-ninth Annual PHMForge: A Scenario-Driven Agentic Benchmark for Industrial Asset Lifecycle Maintenance Conference on Neural Information Processin...

work page 2025

[6] [6]

2018.Condition Monitoring and Diagnostics of Machines: General Guidelines

International Organization for Standardization. 2018.Condition Monitoring and Diagnostics of Machines: General Guidelines. International Organization for Standardization. https://books.google.com/books?id=Y3pXzQEACAAJ

work page 2018

[7] [7]

Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. StableToolBench: Towards Stable Large- Scale Benchmarking on Tool Learning of Large Language Models. InFindings of the Association for Computational Linguistics: ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association fo...

work page doi:10.18653/v1/2024.findings-acl.664 2024

[8] [8]

Kyle Hamilton and Ali Intizar. 2026. Neuro-symbolic AI for Predictive Mainte- nance (PdM) – review and recommendations. arXiv:2602.00731 [cs.AI] https: //arxiv.org/abs/2602.00731

work page arXiv 2026

[9] [9]

Huggingface. 2026. Datasets. https://huggingface.co/datasets Accessed: 2026- 02-08

work page 2026

[10] [10]

IBM Research. 2025. Enterprise Operations Bench Leaderboard. Kaggle. https: //www.kaggle.com/benchmarks/ibm-research/enterprise-ops Accessed: 01-01- 2026

work page 2025

[11] [11]

UC Irvine. 2026. UC Irvine Machine Learning Repository. https://archive.ics. uci.edu/ Accessed: 2026-02-07

work page 2026

[12] [12]

kaggle. 2026. datasets. https://www.kaggle.com/datasets Accessed: 2026-01-10

work page 2026

[13] [13]

2017.Prognostics and Health Management of Engineering Systems

Nam Kim, Dawn An, and Joo-Ho Choi. 2017.Prognostics and Health Management of Engineering Systems. doi:10.1007/978-3-319-44742-1

work page doi:10.1007/978-3-319-44742-1 2017

[14] [14]

Ziyang Luo, Zhiqi Shen, Wenzhuo Yang, Zirui Zhao, Prathyusha Jwalapuram, Amrita Saha, Doyen Sahoo, Silvio Savarese, Caiming Xiong, and Junnan Li. 2025. MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers. arXiv:2508.14704 [cs.AI] https://arxiv.org/abs/2508. 14704

work page arXiv 2025

[15] [15]

Model Context Protocol Steering Group. 2024. Getting Started with the Model Context Protocol. https://modelcontextprotocol.io/docs/getting-started/intro. Accessed: 2026-02-02

work page 2024

[16] [16]

NASAOpenSourceRepo. 2026. NASA Prognostics Center of Excellence Data Set Repository. https://www.nasa.gov/intelligent-systems-division/discovery-and- systems-health/pcoe/pcoe-data-set-repository/ Accessed: 2026-02-03

work page 2026

[17] [17]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/ Accessed: 2026-01-04

work page 2024

[18] [18]

Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. 2025. AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance. arXiv:2506.03828 [cs.AI] https://arxiv.org/abs/2506.03828

work page arXiv 2025

[19] [19]

Dhaval Patel, Nianjun Zhou, Shrey Shrivastava, and Jayant Kalagnanam. 2020. Doctor for Machines: A Failure Pattern Analysis Solution for Industry 4.0. In 2020 IEEE International Conference on Big Data (Big Data). 1614–1623. doi:10. 1109/BigData50022.2020.9378369

work page arXiv 2020

[20] [20]

Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C

James T. Rayfield, Shuxin Lin, Nianjun Zhou, and Dhaval C. Patel. 2025. ReAct Meets Industrial IoT: Language Agents for Data Access. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track. https://openreview.net/forum?id=luETrQw0j6

work page 2025

[21] [21]

Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Dax Vandevoorde, Pedro Las-Casas, Shachee Mishra Gupta, Suman Nath, Chetan Bansal, and Saravan Rajmohan. 2024. Building AI Agents for Autonomous Clouds: Challenges and Design Principles. arXiv:2407.12165 [cs.SE] https://arxiv.org/abs/2407.12165

work page arXiv 2024

[22] [22]

PHM Society. 2026. International Journal of Prognostics and Health Management. https://papers.phmsociety.org/index.php/ijphm/issue/archive Accessed: 2026-02- 04

work page 2026

[23] [23]

Jonathan van der Ven. 2021. Awesome Industrial Datasets. https://github.com/ jonathanwvd/awesome-industrial-datasets. Accessed: 2026-02-08

work page 2021

[24] [24]

Yilin wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, LEI JIA, Yuanxiang Li, and zhongyu wei. 2025. ITFormer: Bridging Time Series and Natural Language for Multi-Modal QA with Large-Scale Multitask Dataset. In Forty-second International Conference on Machine Learning. https://openreview. net/forum?id=GByP03IitA

work page 2025

[25] [25]

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, and Eugene Siow

work page

[26] [26]

Wang et al

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real- World Tasks via MCP Servers. arXiv:2508.20453 [cs.CL] https://arxiv.org/abs/ 2508.20453

work page arXiv

[27] [27]

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik R Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=mXpq6ut8J3

work page 2024

[28] [28]

Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. 2025. PHM-Bench: A Domain-Specific Benchmarking Framework for Systematic Evaluation of Large Models in Prognostics and Health Management. arXiv:2508.02490 [cs.AI] https://arxiv.org/abs/2508.02490

work page arXiv 2025

[29] [29]

What factors are con- tributing to premature bearing failures under high-load conditions in the HUST dataset?

Shunyu Yao, Jeffrey Zhao, Dian Yu, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. ReAct: Synergizing Reasoning and Acting in Language Models. InNeurIPS 2022 Foundation Models for Decision Making Workshop. https:// openreview.net/forum?id=tvI4u1ylcqs A Tool Specifications PHMForge exposes 22 specialized tools via two Model Context Pro- tocol (MCP...

work page 2022