arxiv: 2605.08614 · v1 · submitted 2026-05-09 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

Devin Yasith De Silva , Dhaval Patel , Christodoulos Constantinides , Shuxin Lin , Nianjun Zhou , Paul J Adams , Sal Rosato , Nicolas Constantinides

show 2 more authors

Deborah L. McGuinness Jayant Kalagnanam

Authors on Pith no claims yet

Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM benchmarkindustrial maintenancesymbolic rulesaction recommendationmultiple choice evaluationmodel brittlenesscalibration

0 comments

The pith

Frontier LLMs translate industrial symbolic rules to actions well until rules are structurally perturbed, exposing calibration as the deployment limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds DiagnosticIQ to test whether LLMs can turn engineer-authored symbolic rules about sensor conditions into the right maintenance actions for complex industrial assets. It shows that top models reach similar high scores on the base questions yet lose 13 to 60 percent relative accuracy when distractors are expanded or conditions inverted. Human experts score only 45 percent on average, confirming the questions demand asset-specific specialist knowledge. The authors conclude that raw capability is no longer the issue; reliable use requires better calibration to handle real variations in rule structure.

Core claim

The frontier has closed among top LLMs on template-style rule-to-action tasks, yet every model loses substantial accuracy under structural perturbation and frequently selects the original answer even after condition inversion, revealing pattern matching rather than robust reasoning.

What carries the argument

The symbolic-to-MCQA pipeline that normalizes rules to Disjunctive Normal Form and samples distractors via embeddings to create five probing variants of each question.

If this is right

Top models perform within one Macro point of each other on the base benchmark.
All models show 13-60 percent relative accuracy loss under distractor expansion.
Frontier models still pick the original answer 49-63 percent of the time after condition inversion.
The bottleneck for deployment is calibration for robustness rather than raw capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training or prompting methods that explicitly include structural perturbations could close the observed gap.
The same pipeline might reveal similar calibration issues in other domains that rely on symbolic rules, such as medical protocols or regulatory compliance.
Hybrid systems that pair LLMs with explicit symbolic checkers could reduce reliance on pattern matching.

Load-bearing premise

The generated multiple-choice questions faithfully test the specialist knowledge needed for real maintenance decisions without artifacts from the normalization or distractor choices.

What would settle it

A model that maintains its base accuracy across the Pert and Aug variants with no relative drop would show the claimed brittleness does not hold.

Figures

Figures reproduced from arXiv: 2605.08614 by Christodoulos Constantinides, Deborah L. McGuinness, Devin Yasith De Silva, Dhaval Patel, Jayant Kalagnanam, Nianjun Zhou, Nicolas Constantinides, Paul J Adams, Sal Rosato, Shuxin Lin.

**Figure 1.** Figure 1: The maintenance pipeline: IoT sensors → rule-based alarms → corrective actions. Industrial assets such as wind turbines, air handling units, and chillers require significant domain expertise to operate, maintain, and tune effectively. They are frequently deployed in operationally critical environments such as healthcare facilities, wind farms [22] and large data centers [23, 26], where reliability and … view at source ↗

**Figure 2.** Figure 2: End-to-end workflow of dataset generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Symbolic Conditions to MCQA Pipeline. “Met for 2 Hours” applies to any condition. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: DiagnosticIQ Pro composition by asset type (outer ring), option count (inner ring), and question type. We apply the pipeline described in Section 3 on 118 expert-curated rules to construct DiagnosticIQ. We set the hyper-parameters Nsel_topk = 25, Neli_topk = 25, NQT = 10, α = 10, and β = 10. The resulting dataset contains 6690 questions, with composition shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Field-wide leadership on GPQA Diamond [2], DiagnosticIQ and DiagnosticIQ Pro Frontier progression has stalled on industrialmaintenance reasoning. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Mean feature values for correctly answered, failed, and bottom-10% questions. Difficulty predictors within DiagnosticIQ. We build a logistic regression model to predict per-question correctness for claude-opus-4-6 using four features ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: The five-stage rule construction lifecycle underlying DiagnosticIQ. Reliability Engineers, [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Word count distribution of unique actions in the expert-curated dataset. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Dataset size vs. mean IoU as α and β vary. Bubble size is proportional to question count; higher α, β yields more questions but lower option diversity [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Rule-to-rule similarity heatmap based on [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Expert ratings of model-generated rationales for mistral-large. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 14.** Figure 14: Per-asset accuracy for claude-opus-4-6 on DiagnosticIQ and DiagnosticIQ Pro. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: provides an example LLM prompt; [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗

**Figure 17.** Figure 17: Question difficulty across 5,180 positive-type simpleV questions. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: Bradley-Terry Elo ratings with 95% bootstrap confidence intervals. Models are sorted by [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: Per-respondent accuracy on 40 DiagnosticIQ questions. Bars are sorted by accuracy [PITH_FULL_IMAGE:figures/full_fig_p037_19.png] view at source ↗

**Figure 20.** Figure 20: Human and LLM accuracy on the shared 40-question subset. [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗

**Figure 21.** Figure 21: Per-rule accuracy distributions for the top-5 (blue) and bottom-5 (orange) models. Red [PITH_FULL_IMAGE:figures/full_fig_p040_21.png] view at source ↗

**Figure 22.** Figure 22: Observed vs. binomial-null fraction of rules at [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗

**Figure 23.** Figure 23: Effective sample size per model after correcting for within-rule clustering. The red dashed [PITH_FULL_IMAGE:figures/full_fig_p042_23.png] view at source ↗

**Figure 24.** Figure 24: Naive question-level (blue) vs. design-effect-corrected (red) [PITH_FULL_IMAGE:figures/full_fig_p043_24.png] view at source ↗

**Figure 25.** Figure 25: Cumulative share of correct answers as a function of rule rank (rules sorted by descending [PITH_FULL_IMAGE:figures/full_fig_p043_25.png] view at source ↗

read the original abstract

Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DiagnosticIQ gives a new benchmark with real industrial rules and shows clear LLM drops on perturbations, but the pipeline risks creating questions that don't cleanly test specialist knowledge.

read the letter

The paper introduces DiagnosticIQ, a benchmark of 6690 multiple-choice questions built from 118 engineer-authored rules across 16 asset types. It converts symbolic rules to DNF, samples distractors via embeddings, and adds five variants to probe different failure modes. They run 29 LLMs plus embedding baselines and report that top models are close in score but lose 13-60% relative accuracy when distractors expand and still pick the original answer 49-63% of the time under condition inversion. A small human check with nine practitioners at 45% mean accuracy is included to argue the questions need specialist knowledge.

Referee Report

2 major / 2 minor

Summary. The paper introduces DiagnosticIQ, a benchmark of 6,690 expert-validated MCQs derived from 118 rule-action pairs across 16 industrial asset types. It describes a symbolic-to-MCQA pipeline that normalizes rules to DNF and uses embedding-based distractor sampling, evaluates 29 LLMs across five variants (Pro, Pert, Verbose, Aug, Rationale), and reports that frontier models achieve high accuracy on template-style questions but suffer 13-60% relative accuracy drops and 49-63% inversion persistence under structural perturbations. Human evaluation with 9 practitioners yields 45% mean accuracy, supporting the claim that the deployment bottleneck is calibration rather than capability.

Significance. If the benchmark construction faithfully isolates specialist maintenance knowledge without pipeline artifacts, the work provides a useful domain-specific evaluation framework and dataset that highlights robustness limitations in current LLMs for industrial decision support. The quantitative findings on brittleness and pattern-matching, together with the human baseline, could inform targeted improvements in LLM calibration for safety-critical applications.

major comments (2)

[§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.
[Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.

minor comments (2)

[Results] The abstract and results sections report relative accuracy drops (13-60%) and inversion persistence (49-63%) but do not specify the exact baseline accuracy values or the statistical method used to compute these figures.
[§4 (Benchmark Results)] The Bradley-Terry Elo ranking places claude-opus-4-6 30 points above the next model, but the paper does not report the full ranking table or the number of pairwise comparisons underlying the model.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive suggestions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.

Authors: We agree that providing concrete examples and analysis of the pipeline would help address concerns about potential artifacts. In the revised manuscript, we will add an appendix containing side-by-side comparisons of original rules and their DNF-normalized versions for several asset types, along with a discussion of any introduced equivalences or scope changes. We will also include a manual validation by a domain expert on a subset of distractors to assess domain plausibility in addition to embedding similarity. revision: yes
Referee: [Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.

Authors: We acknowledge these reporting gaps in the human evaluation. We will include inter-rater reliability measures (e.g., Fleiss' kappa) and statistical tests for the reported accuracies in the revision. The original study design prioritized confirming the requirement for specialist knowledge through domain-experienced practitioners rather than including general reasoning controls or operational baselines; we will add a dedicated limitations paragraph discussing this choice and its implications for interpreting the results. The 45% mean accuracy nonetheless indicates the questions are challenging even for experts. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark construction and evaluation

full rationale

The paper introduces DiagnosticIQ as an empirical benchmark derived from 118 engineer-authored rule-action pairs across 16 asset types. It describes a symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) and five variants (Pro, Pert, Verbose, Aug, Rationale), then reports performance of 29 LLMs plus human validation (9 practitioners, mean 45%). No equations, fitted parameters, or predictions are defined in terms of the paper's own outputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims (frontier models close in score but brittle under perturbation; bottleneck is calibration) rest on direct experimental measurements rather than any reduction to inputs by construction. This is standard benchmark methodology with no self-referential derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on one domain assumption about rule normalization and standard LLM evaluation practices; it introduces no free parameters or invented entities.

axioms (1)

domain assumption Symbolic rules can be normalized to Disjunctive Normal Form without loss of meaning for generating valid multiple-choice action-recommendation questions.
Invoked in the symbolic-to-MCQA pipeline described in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1337 out tokens · 75146 ms · 2026-05-12T00:56:52.016281+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling... DiagnosticIQ Aug... inverting all temporal comparison operators
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel and Jcost_pos_of_ne_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

frontier models handle template-style fault detection but break under structural perturbation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 7 internal anchors

[1]

GEPA: Reflective prompt evolution can outperform reinforcement learning

Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...

work page 2026
[2]

GPQA Diamond Benchmark Leaderboard

Artificial Analysis. GPQA Diamond Benchmark Leaderboard. https:// artificialanalysis.ai/evaluations/gpqa-diamond, 2026. Accessed: 2026-05-02

work page 2026
[3]

Standards and guidelines

ASHRAE. Standards and guidelines. https://www.ashrae.org/technical-resources/ standards-and-guidelines, n.d. Accessed: 2025-11-20

work page 2025
[4]

Data-driven fault detection and diagnosis for hvac water chillers.Control Engineering Practice, 53:79–91, 2016

A Beghi, R Brignoli, Luca Cecchinato, Gabriele Menegazzo, Mirco Rampazzo, and F Simmini. Data-driven fault detection and diagnosis for hvac water chillers.Control Engineering Practice, 53:79–91, 2016

work page 2016
[5]

Brown, T

Lawrence D. Brown, T. Tony Cai, and Anirban Dasgupta. Interval estimation for a binomial proportion.Statistical Science, 16:101–133, 2001. URL https://api.semanticscholar. org/CorpusID:7039587

work page 2001
[7]

FinTextQA: A dataset for long-form financial question answering

Jian Chen, Peilin Zhou, Yining Hua, Loh Xin, Kehui Chen, Ziyuan Li, Bing Zhu, and Junwei Liang. FinTextQA: A dataset for long-form financial question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6047, Bangkok,...

work page 2024
[8]

Failuresensoriq: A multi-choice qa dataset for understanding sensor relationships and failure modes, 2025

Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, and Jayant Kalagnanam. Failuresensoriq: A multi-choice qa dataset for understanding sensor relationships and failure modes, 2025. URLhttps://arxiv.org/abs/2506.03278

work page arXiv 2025
[9]

Anomaly detection for iot time-series data: A survey.IEEE Internet of Things Journal, 7(7):6481–6494, 2019

Andrew A Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey.IEEE Internet of Things Journal, 7(7):6481–6494, 2019

work page 2019
[10]

Intelligent maintenance powered by iot and ai, Apr 2026

Cytiva. Intelligent maintenance powered by iot and ai, Apr 2026. URL https://www. cytivalifesciences.com/en/us/insights/intelligent-equipment-maintenance. Ac- cessed: 2026-05-02

work page 2026
[11]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URLhttps://arxiv.org/abs/2305.14314

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[13]

Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning

Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 20...

work page doi:10.18653/v1/2024.emnlp-industry.61 2024
[14]

Machine learning for predictive maintenance of industrial machines using iot sensor data

Ameeth Kanawaday and Aditya Sane. Machine learning for predictive maintenance of industrial machines using iot sensor data. In2017 8th IEEE international conference on software engineering and service science (ICSESS), pages 87–90. IEEE, 2017. 11

work page 2017
[15]

Time-mqa: Time series multi-task question answering with context enhancement

Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-mqa: Time series multi-task question answering with context enhancement. InAnnual Meeting of the Association for Computational Linguistics,

work page
[16]

URLhttps://api.semanticscholar.org/CorpusID:276774750

work page
[17]

TelBench: A benchmark for evaluating telco-specific large language models

Sunwoo Lee, Dhammiko Arya, Seung-Mo Cho, Gyoung-eun Han, Seokyoung Hong, Won- beom Jang, Seojin Lee, Sohee Park, Sereimony Sek, Injee Song, Sungbin Yoon, and Eric Davis. TelBench: A benchmark for evaluating telco-specific large language models. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Confer...

work page doi:10.18653/v1/2024.emnlp-industry.45 2024
[18]

Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,

Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, and Wei Lin. Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,

work page
[19]

URLhttps://arxiv.org/abs/2405.19740

work page arXiv
[20]

Active multi-mode data analysis to improve fault diagnosis in ahus.Energy and Buildings, 337: 115621, 2025

Guanjing Lin, John House, Yimin Chen, Jessica Granderson, and Wanpeng Zhang. Active multi-mode data analysis to improve fault diagnosis in ahus.Energy and Buildings, 337: 115621, 2025. ISSN 0378-7788. doi: https://doi.org/10.1016/j.enbuild.2025.115621. URL https://www.sciencedirect.com/science/article/pii/S0378778825003512

work page doi:10.1016/j.enbuild.2025.115621 2025
[21]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022
[22]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs

Max Malyi, Jonathan Shek, Alasdair McDonald, and Andre Biscaya. A comparative benchmark of large language models for labelling wind turbine maintenance logs, 2025. URL https: //arxiv.org/abs/2509.06813

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

WFCRL: A multi-agent reinforcement learning benchmark for wind farm control

Claire Bizon Monroc, Ana Busic, Donatien Dubuc, and Jiamin Zhu. WFCRL: A multi-agent reinforcement learning benchmark for wind farm control. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=ZRMAhpZ3ED

work page 2024
[25]

LC-opt: Benchmarking reinforcement learning and agentic AI for end- to-end liquid cooling optimization in data centers

Avisek Naug, Antonio Guillen-Perez, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sa- hand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, and Soumyendu Sarkar. LC-opt: Benchmarking reinforcement learning and agentic AI for end- to-end liquid cooling optimization in data centers. InThe Thirty-ninth Annual Conference on Neural Inf...

work page 2026
[26]

Add failure diagnostics information to asset incidents and anomalies, 2025

Oracle. Add failure diagnostics information to asset incidents and anomalies, 2025. Oracle IoT Asset Monitoring Cloud Service

work page 2025
[27]

Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026

OxMaint. Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026. URLhttps: //oxmaint.com/article/prescriptive-maintenance-cmms-guide . Accessed: 2026-05- 02

work page 2026
[28]

Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance, 2025

Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika, Surya- narayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance.arXiv preprint arXiv:2506.03828, 2025

work page arXiv 2025
[29]

Humanity's Last Exam

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments

Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Dr Payal Arvind Kasat, Somak Aditya, and Pawan Goyal. ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1...

work page doi:10.18653/v1/2024.emnlp-main.873 2024
[31]

M. Raza, Z. Jahangir, M. B. Riaz, et al. Industrial applications of large language models. Scientific Reports, 15:13755, 2025. doi: 10.1038/s41598-025-98483-1. URL https://doi. org/10.1038/s41598-025-98483-1

work page doi:10.1038/s41598-025-98483-1 2025
[32]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.ArXiv, abs/1908.10084, 2019. URL https://api.semanticscholar.org/ CorpusID:201646309

work page internal anchor Pith review Pith/arXiv arXiv 1908
[33]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98

work page 2024
[34]

Leveraging large language models for multiple choice question answering, 2023

Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering, 2023. URL https://arxiv.org/abs/2210. 12353

work page 2023
[35]

ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures

Tobias Schimanski, Jingwei Ni, Roberto Spacey Martín, Nicola Ranger, and Markus Leippold. ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17509– 175...

work page doi:10.18653/v1/2024.emnlp-main.969 2024
[36]

Anomaly detection in iiot: A case study using machine learning

Gauri Shah and Aashis Tiwari. Anomaly detection in iiot: A case study using machine learning. InProceedings of the ACM India joint international conference on data science and management of data, pages 295–300, 2018

work page 2018
[37]

SkySpark Analytics Platform

SkyFoundry, LLC. SkySpark Analytics Platform. https://skyfoundry.com/product, 2026. Accessed: 2026-05-06

work page 2026
[38]

Large language models for forecasting and anomaly detection: A systematic literature review, 2024

Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, and Junhong Lin. Large language models for forecasting and anomaly detection: A systematic literature review, 2024. URLhttps://arxiv.org/abs/2402.10350

work page arXiv 2024
[39]

CasiMedicos-arg: A medical question answering dataset annotated with explanatory argumentative structures

Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, and Rodrigo Agerri. CasiMedicos-arg: A medical question answering dataset annotated with explanatory argumentative structures. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Pr...

work page doi:10.18653/v1/2024.emnlp-main.1026 2024
[40]

Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset

Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset. In42nd International Conference on Machine Learning, volume abs/2506.20093, 2025. URLhttps://api.semanticscholar.org/CorpusID:280000242

work page arXiv 2025
[41]

MMLU-pro: A more robust and challenging multi-task language understanding benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...

work page 2024
[42]

Benchmarking complex instruction-following with multiple constraints composition

Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxing Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking complex instruction-following with multiple constraints composition. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 202...

work page 2024
[43]

Smarter Planet

Wikipedia contributors. Smarter Planet. https://en.wikipedia.org/wiki/Smarter_ Planet, 2026. Accessed: 2026-05-06

work page 2026
[44]

Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management, 2025

Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management, 2025. URLhttps://arxiv.org/abs/2508.02490

work page arXiv 2025
[45]

Benchmarking llms via uncertainty quantification.Advances in Neural Information Processing Systems, 37:15356–15385, 2024

Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking llms via uncertainty quantification.Advances in Neural Information Processing Systems, 37:15356–15385, 2024

work page 2024
[46]

Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025

Feng Zhang, Chengjie Pang, Yuehan Zhang, and Chenyu Luo. Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025. URL https://arxiv.org/abs/ 2508.20420

work page arXiv 2025
[47]

RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance

Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Langua...

work page doi:10.18653/v1/2024.emnlp-industry.56 2024
[48]

RuAG: Learned-rule-augmented generation for large language models

Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yevgeniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. RuAG: Learned-rule-augmented generation for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.ne...

work page 2025
[49]

Multiple-choice questions are efficient and robust llm evaluators, 2024

Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, and Rui Wang. Multiple-choice questions are efficient and robust llm evaluators, 2024. URLhttps://arxiv.org/abs/2405. 11966

work page 2024
[50]

Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023

Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors, 2024. URLhttps://arxiv.org/abs/2309.03882

work page arXiv 2024
[51]

Natural language processing approaches in industrial maintenance: A systematic literature review.Procedia Computer Sci- ence, 232:2082–2097, 2024

Keyi Zhong, Tom Jackson, Andrew West, and Georgina Cosma. Natural language processing approaches in industrial maintenance: A systematic literature review.Procedia Computer Sci- ence, 232:2082–2097, 2024. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2024.02.029. URL https://www.sciencedirect.com/science/article/pii/S1877050924002060. 5th Internati...

work page doi:10.1016/j.procs.2024.02.029 2082
[52]

2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc

Condition extraction.Atomic conditions {QC i k} are extracted from the rule’s condition treeT R i. 2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc

work page
[53]

Option generation.Candidate (question, options, answer) tuples are produced via two strategies:selection(relevant observations RRSim-similar to the rule action) andelimina- tion(irrelevant observations as distractors)

work page
[54]

None of the above

Question assembly.Each extracted condition is combined with each (question, options, answer) tuple to produce one MCQA item, which is appended to the datasetDS Q. An example generated prompt is shown in Figure 15. Pipeline extensibility.To demonstrate that the MCQA-generation pipeline extends beyond our primary rule corpus, we applied it to 28 rules manua...

work page 2048