pith. machine review for the scientific record. sign in

arxiv: 2605.08614 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM benchmarkindustrial maintenancesymbolic rulesaction recommendationmultiple choice evaluationmodel brittlenesscalibration
0
0 comments X

The pith

Frontier LLMs translate industrial symbolic rules to actions well until rules are structurally perturbed, exposing calibration as the deployment limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds DiagnosticIQ to test whether LLMs can turn engineer-authored symbolic rules about sensor conditions into the right maintenance actions for complex industrial assets. It shows that top models reach similar high scores on the base questions yet lose 13 to 60 percent relative accuracy when distractors are expanded or conditions inverted. Human experts score only 45 percent on average, confirming the questions demand asset-specific specialist knowledge. The authors conclude that raw capability is no longer the issue; reliable use requires better calibration to handle real variations in rule structure.

Core claim

The frontier has closed among top LLMs on template-style rule-to-action tasks, yet every model loses substantial accuracy under structural perturbation and frequently selects the original answer even after condition inversion, revealing pattern matching rather than robust reasoning.

What carries the argument

The symbolic-to-MCQA pipeline that normalizes rules to Disjunctive Normal Form and samples distractors via embeddings to create five probing variants of each question.

If this is right

  • Top models perform within one Macro point of each other on the base benchmark.
  • All models show 13-60 percent relative accuracy loss under distractor expansion.
  • Frontier models still pick the original answer 49-63 percent of the time after condition inversion.
  • The bottleneck for deployment is calibration for robustness rather than raw capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Training or prompting methods that explicitly include structural perturbations could close the observed gap.
  • The same pipeline might reveal similar calibration issues in other domains that rely on symbolic rules, such as medical protocols or regulatory compliance.
  • Hybrid systems that pair LLMs with explicit symbolic checkers could reduce reliance on pattern matching.

Load-bearing premise

The generated multiple-choice questions faithfully test the specialist knowledge needed for real maintenance decisions without artifacts from the normalization or distractor choices.

What would settle it

A model that maintains its base accuracy across the Pert and Aug variants with no relative drop would show the claimed brittleness does not hold.

Figures

Figures reproduced from arXiv: 2605.08614 by Christodoulos Constantinides, Deborah L. McGuinness, Devin Yasith De Silva, Dhaval Patel, Jayant Kalagnanam, Nianjun Zhou, Nicolas Constantinides, Paul J Adams, Sal Rosato, Shuxin Lin.

Figure 1
Figure 1. Figure 1: The maintenance pipeline: IoT sensors → rule-based alarms → corrective actions. Industrial assets such as wind turbines, air han￾dling units, and chillers require significant do￾main expertise to operate, maintain, and tune effectively. They are frequently deployed in op￾erationally critical environments such as health￾care facilities, wind farms [22] and large data centers [23, 26], where reliability and … view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end workflow of dataset generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Symbolic Conditions to MCQA Pipeline. “Met for 2 Hours” applies to any condition. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: DiagnosticIQ Pro composition by asset type (outer ring), option count (inner ring), and question type. We apply the pipeline described in Section 3 on 118 expert-curated rules to construct Diagnosti￾cIQ. We set the hyper-parameters Nsel_topk = 25, Neli_topk = 25, NQT = 10, α = 10, and β = 10. The resulting dataset contains 6690 questions, with composition shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Field-wide leadership on GPQA Dia￾mond [2], DiagnosticIQ and DiagnosticIQ Pro Frontier progression has stalled on industrial￾maintenance reasoning. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean feature values for correctly an￾swered, failed, and bottom-10% questions. Difficulty predictors within DiagnosticIQ. We build a logistic regression model to predict per-question correctness for claude-opus-4-6 us￾ing four features ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The five-stage rule construction lifecycle underlying DiagnosticIQ. Reliability Engineers, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Word count distribution of unique actions in the expert-curated dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dataset size vs. mean IoU as α and β vary. Bubble size is proportional to question count; higher α, β yields more questions but lower option diversity [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Rule-to-rule similarity heatmap based on [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Expert ratings of model-generated rationales for mistral-large. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 14
Figure 14. Figure 14: Per-asset accuracy for claude-opus-4-6 on DiagnosticIQ and DiagnosticIQ Pro. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: provides an example LLM prompt; [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 17
Figure 17. Figure 17: Question difficulty across 5,180 positive-type simpleV questions. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Bradley-Terry Elo ratings with 95% bootstrap confidence intervals. Models are sorted by [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Per-respondent accuracy on 40 DiagnosticIQ questions. Bars are sorted by accuracy [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Human and LLM accuracy on the shared 40-question subset. [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Per-rule accuracy distributions for the top-5 (blue) and bottom-5 (orange) models. Red [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Observed vs. binomial-null fraction of rules at [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Effective sample size per model after correcting for within-rule clustering. The red dashed [PITH_FULL_IMAGE:figures/full_fig_p042_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Naive question-level (blue) vs. design-effect-corrected (red) [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Cumulative share of correct answers as a function of rule rank (rules sorted by descending [PITH_FULL_IMAGE:figures/full_fig_p043_25.png] view at source ↗
read the original abstract

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiagnosticIQ, a benchmark of 6,690 expert-validated MCQs derived from 118 rule-action pairs across 16 industrial asset types. It describes a symbolic-to-MCQA pipeline that normalizes rules to DNF and uses embedding-based distractor sampling, evaluates 29 LLMs across five variants (Pro, Pert, Verbose, Aug, Rationale), and reports that frontier models achieve high accuracy on template-style questions but suffer 13-60% relative accuracy drops and 49-63% inversion persistence under structural perturbations. Human evaluation with 9 practitioners yields 45% mean accuracy, supporting the claim that the deployment bottleneck is calibration rather than capability.

Significance. If the benchmark construction faithfully isolates specialist maintenance knowledge without pipeline artifacts, the work provides a useful domain-specific evaluation framework and dataset that highlights robustness limitations in current LLMs for industrial decision support. The quantitative findings on brittleness and pattern-matching, together with the human baseline, could inform targeted improvements in LLM calibration for safety-critical applications.

major comments (2)
  1. [§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.
  2. [Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.
minor comments (2)
  1. [Results] The abstract and results sections report relative accuracy drops (13-60%) and inversion persistence (49-63%) but do not specify the exact baseline accuracy values or the statistical method used to compute these figures.
  2. [§4 (Benchmark Results)] The Bradley-Terry Elo ranking places claude-opus-4-6 30 points above the next model, but the paper does not report the full ranking table or the number of pairwise comparisons underlying the model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.

    Authors: We agree that providing concrete examples and analysis of the pipeline would help address concerns about potential artifacts. In the revised manuscript, we will add an appendix containing side-by-side comparisons of original rules and their DNF-normalized versions for several asset types, along with a discussion of any introduced equivalences or scope changes. We will also include a manual validation by a domain expert on a subset of distractors to assess domain plausibility in addition to embedding similarity. revision: yes

  2. Referee: [Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.

    Authors: We acknowledge these reporting gaps in the human evaluation. We will include inter-rater reliability measures (e.g., Fleiss' kappa) and statistical tests for the reported accuracies in the revision. The original study design prioritized confirming the requirement for specialist knowledge through domain-experienced practitioners rather than including general reasoning controls or operational baselines; we will add a dedicated limitations paragraph discussing this choice and its implications for interpreting the results. The 45% mean accuracy nonetheless indicates the questions are challenging even for experts. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper introduces DiagnosticIQ as an empirical benchmark derived from 118 engineer-authored rule-action pairs across 16 asset types. It describes a symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) and five variants (Pro, Pert, Verbose, Aug, Rationale), then reports performance of 29 LLMs plus human validation (9 practitioners, mean 45%). No equations, fitted parameters, or predictions are defined in terms of the paper's own outputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims (frontier models close in score but brittle under perturbation; bottleneck is calibration) rest on direct experimental measurements rather than any reduction to inputs by construction. This is standard benchmark methodology with no self-referential derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on one domain assumption about rule normalization and standard LLM evaluation practices; it introduces no free parameters or invented entities.

axioms (1)
  • domain assumption Symbolic rules can be normalized to Disjunctive Normal Form without loss of meaning for generating valid multiple-choice action-recommendation questions.
    Invoked in the symbolic-to-MCQA pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1337 out tokens · 75146 ms · 2026-05-12T00:56:52.016281+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

  1. [1]

    GEPA: Reflective prompt evolution can outperform reinforcement learning

    Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

  2. [2]

    GPQA Diamond Benchmark Leaderboard

    Artificial Analysis. GPQA Diamond Benchmark Leaderboard. https:// artificialanalysis.ai/evaluations/gpqa-diamond, 2026. Accessed: 2026-05-02

  3. [3]

    Standards and guidelines

    ASHRAE. Standards and guidelines. https://www.ashrae.org/technical-resources/ standards-and-guidelines, n.d. Accessed: 2025-11-20

  4. [4]

    Data-driven fault detection and diagnosis for hvac water chillers.Control Engineering Practice, 53:79–91, 2016

    A Beghi, R Brignoli, Luca Cecchinato, Gabriele Menegazzo, Mirco Rampazzo, and F Simmini. Data-driven fault detection and diagnosis for hvac water chillers.Control Engineering Practice, 53:79–91, 2016

  5. [5]

    Brown, T

    Lawrence D. Brown, T. Tony Cai, and Anirban Dasgupta. Interval estimation for a binomial proportion.Statistical Science, 16:101–133, 2001. URL https://api.semanticscholar. org/CorpusID:7039587

  6. [7]

    FinTextQA: A dataset for long-form financial question answering

    Jian Chen, Peilin Zhou, Yining Hua, Loh Xin, Kehui Chen, Ziyuan Li, Bing Zhu, and Junwei Liang. FinTextQA: A dataset for long-form financial question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6047, Bangkok,...

  7. [8]

    Failuresensoriq: A multi-choice qa dataset for understanding sensor relationships and failure modes, 2025

    Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, and Jayant Kalagnanam. Failuresensoriq: A multi-choice qa dataset for understanding sensor relationships and failure modes, 2025. URLhttps://arxiv.org/abs/2506.03278

  8. [9]

    Anomaly detection for iot time-series data: A survey.IEEE Internet of Things Journal, 7(7):6481–6494, 2019

    Andrew A Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey.IEEE Internet of Things Journal, 7(7):6481–6494, 2019

  9. [10]

    Intelligent maintenance powered by iot and ai, Apr 2026

    Cytiva. Intelligent maintenance powered by iot and ai, Apr 2026. URL https://www. cytivalifesciences.com/en/us/insights/intelligent-equipment-maintenance. Ac- cessed: 2026-05-02

  10. [11]

    QLoRA: Efficient Finetuning of Quantized LLMs

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URLhttps://arxiv.org/abs/2305.14314

  11. [12]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  12. [13]

    Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning

    Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 20...

  13. [14]

    Machine learning for predictive maintenance of industrial machines using iot sensor data

    Ameeth Kanawaday and Aditya Sane. Machine learning for predictive maintenance of industrial machines using iot sensor data. In2017 8th IEEE international conference on software engineering and service science (ICSESS), pages 87–90. IEEE, 2017. 11

  14. [15]

    Time-mqa: Time series multi-task question answering with context enhancement

    Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-mqa: Time series multi-task question answering with context enhancement. InAnnual Meeting of the Association for Computational Linguistics,

  15. [16]

    URLhttps://api.semanticscholar.org/CorpusID:276774750

  16. [17]

    TelBench: A benchmark for evaluating telco-specific large language models

    Sunwoo Lee, Dhammiko Arya, Seung-Mo Cho, Gyoung-eun Han, Seokyoung Hong, Won- beom Jang, Seojin Lee, Sohee Park, Sereimony Sek, Injee Song, Sungbin Yoon, and Eric Davis. TelBench: A benchmark for evaluating telco-specific large language models. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Confer...

  17. [18]

    Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,

    Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, and Wei Lin. Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,

  18. [19]

    URLhttps://arxiv.org/abs/2405.19740

  19. [20]

    Active multi-mode data analysis to improve fault diagnosis in ahus.Energy and Buildings, 337: 115621, 2025

    Guanjing Lin, John House, Yimin Chen, Jessica Granderson, and Wanpeng Zhang. Active multi-mode data analysis to improve fault diagnosis in ahus.Energy and Buildings, 337: 115621, 2025. ISSN 0378-7788. doi: https://doi.org/10.1016/j.enbuild.2025.115621. URL https://www.sciencedirect.com/science/article/pii/S0378778825003512

  20. [21]

    TruthfulQA: Measuring How Models Mimic Human Falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

  21. [22]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  22. [23]

    A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

    Max Malyi, Jonathan Shek, Alasdair McDonald, and Andre Biscaya. A comparative benchmark of large language models for labelling wind turbine maintenance logs, 2025. URL https: //arxiv.org/abs/2509.06813

  23. [24]

    WFCRL: A multi-agent reinforcement learning benchmark for wind farm control

    Claire Bizon Monroc, Ana Busic, Donatien Dubuc, and Jiamin Zhu. WFCRL: A multi-agent reinforcement learning benchmark for wind farm control. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=ZRMAhpZ3ED

  24. [25]

    LC-opt: Benchmarking reinforcement learning and agentic AI for end- to-end liquid cooling optimization in data centers

    Avisek Naug, Antonio Guillen-Perez, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sa- hand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, and Soumyendu Sarkar. LC-opt: Benchmarking reinforcement learning and agentic AI for end- to-end liquid cooling optimization in data centers. InThe Thirty-ninth Annual Conference on Neural Inf...

  25. [26]

    Add failure diagnostics information to asset incidents and anomalies, 2025

    Oracle. Add failure diagnostics information to asset incidents and anomalies, 2025. Oracle IoT Asset Monitoring Cloud Service

  26. [27]

    Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026

    OxMaint. Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026. URLhttps: //oxmaint.com/article/prescriptive-maintenance-cmms-guide . Accessed: 2026-05- 02

  27. [28]

    Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance, 2025

    Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika, Surya- narayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance.arXiv preprint arXiv:2506.03828, 2025

  28. [29]

    Humanity's Last Exam

    Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025. 12

  29. [30]

    ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments

    Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Dr Payal Arvind Kasat, Somak Aditya, and Pawan Goyal. ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1...

  30. [31]

    M. Raza, Z. Jahangir, M. B. Riaz, et al. Industrial applications of large language models. Scientific Reports, 15:13755, 2025. doi: 10.1038/s41598-025-98483-1. URL https://doi. org/10.1038/s41598-025-98483-1

  31. [32]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.ArXiv, abs/1908.10084, 2019. URL https://api.semanticscholar.org/ CorpusID:201646309

  32. [33]

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

  33. [34]

    Leveraging large language models for multiple choice question answering, 2023

    Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering, 2023. URL https://arxiv.org/abs/2210. 12353

  34. [35]

    ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures

    Tobias Schimanski, Jingwei Ni, Roberto Spacey Martín, Nicola Ranger, and Markus Leippold. ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17509– 175...

  35. [36]

    Anomaly detection in iiot: A case study using machine learning

    Gauri Shah and Aashis Tiwari. Anomaly detection in iiot: A case study using machine learning. InProceedings of the ACM India joint international conference on data science and management of data, pages 295–300, 2018

  36. [37]

    SkySpark Analytics Platform

    SkyFoundry, LLC. SkySpark Analytics Platform. https://skyfoundry.com/product, 2026. Accessed: 2026-05-06

  37. [38]

    Large language models for forecasting and anomaly detection: A systematic literature review, 2024

    Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, and Junhong Lin. Large language models for forecasting and anomaly detection: A systematic literature review, 2024. URLhttps://arxiv.org/abs/2402.10350

  38. [39]

    CasiMedicos-arg: A medical question answering dataset annotated with explanatory argumentative structures

    Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, and Rodrigo Agerri. CasiMedicos-arg: A medical question answering dataset annotated with explanatory argumentative structures. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Pr...

  39. [40]

    Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset

    Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset. In42nd International Conference on Machine Learning, volume abs/2506.20093, 2025. URLhttps://api.semanticscholar.org/CorpusID:280000242

  40. [41]

    MMLU-pro: A more robust and challenging multi-task language understanding benchmark

    Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...

  41. [42]

    Benchmarking complex instruction-following with multiple constraints composition

    Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxing Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking complex instruction-following with multiple constraints composition. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 202...

  42. [43]

    Smarter Planet

    Wikipedia contributors. Smarter Planet. https://en.wikipedia.org/wiki/Smarter_ Planet, 2026. Accessed: 2026-05-06

  43. [44]

    Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management, 2025

    Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management, 2025. URLhttps://arxiv.org/abs/2508.02490

  44. [45]

    Benchmarking llms via uncertainty quantification.Advances in Neural Information Processing Systems, 37:15356–15385, 2024

    Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking llms via uncertainty quantification.Advances in Neural Information Processing Systems, 37:15356–15385, 2024

  45. [46]

    Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025

    Feng Zhang, Chengjie Pang, Yuehan Zhang, and Chenyu Luo. Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025. URL https://arxiv.org/abs/ 2508.20420

  46. [47]

    RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance

    Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Langua...

  47. [48]

    RuAG: Learned-rule-augmented generation for large language models

    Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yevgeniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. RuAG: Learned-rule-augmented generation for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.ne...

  48. [49]

    Multiple-choice questions are efficient and robust llm evaluators, 2024

    Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, and Rui Wang. Multiple-choice questions are efficient and robust llm evaluators, 2024. URLhttps://arxiv.org/abs/2405. 11966

  49. [50]

    Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors, 2024. URLhttps://arxiv.org/abs/2309.03882

  50. [51]

    Natural language processing approaches in industrial maintenance: A systematic literature review.Procedia Computer Sci- ence, 232:2082–2097, 2024

    Keyi Zhong, Tom Jackson, Andrew West, and Georgina Cosma. Natural language processing approaches in industrial maintenance: A systematic literature review.Procedia Computer Sci- ence, 232:2082–2097, 2024. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2024.02.029. URL https://www.sciencedirect.com/science/article/pii/S1877050924002060. 5th Internati...

  51. [52]

    2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc

    Condition extraction.Atomic conditions {QC i k} are extracted from the rule’s condition treeT R i. 2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc

  52. [53]

    Option generation.Candidate (question, options, answer) tuples are produced via two strategies:selection(relevant observations RRSim-similar to the rule action) andelimina- tion(irrelevant observations as distractors)

  53. [54]

    None of the above

    Question assembly.Each extracted condition is combined with each (question, options, answer) tuple to produce one MCQA item, which is appended to the datasetDS Q. An example generated prompt is shown in Figure 15. Pipeline extensibility.To demonstrate that the MCQA-generation pipeline extends beyond our primary rule corpus, we applied it to 28 rules manua...