Recognition: 2 theorem links
· Lean TheoremDiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
Pith reviewed 2026-05-12 00:56 UTC · model grok-4.3
The pith
Frontier LLMs translate industrial symbolic rules to actions well until rules are structurally perturbed, exposing calibration as the deployment limit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The frontier has closed among top LLMs on template-style rule-to-action tasks, yet every model loses substantial accuracy under structural perturbation and frequently selects the original answer even after condition inversion, revealing pattern matching rather than robust reasoning.
What carries the argument
The symbolic-to-MCQA pipeline that normalizes rules to Disjunctive Normal Form and samples distractors via embeddings to create five probing variants of each question.
If this is right
- Top models perform within one Macro point of each other on the base benchmark.
- All models show 13-60 percent relative accuracy loss under distractor expansion.
- Frontier models still pick the original answer 49-63 percent of the time after condition inversion.
- The bottleneck for deployment is calibration for robustness rather than raw capability.
Where Pith is reading between the lines
- Training or prompting methods that explicitly include structural perturbations could close the observed gap.
- The same pipeline might reveal similar calibration issues in other domains that rely on symbolic rules, such as medical protocols or regulatory compliance.
- Hybrid systems that pair LLMs with explicit symbolic checkers could reduce reliance on pattern matching.
Load-bearing premise
The generated multiple-choice questions faithfully test the specialist knowledge needed for real maintenance decisions without artifacts from the normalization or distractor choices.
What would settle it
A model that maintains its base accuracy across the Pert and Aug variants with no relative drop would show the claimed brittleness does not hold.
Figures
read the original abstract
Monitoring complex industrial assets relies on engineer-authored symbolic rules that trigger based on sensor conditions and prompt technicians to perform corrective actions. The bottleneck is not detection but response: translating rules into maintenance steps requires asset-specific knowledge gained through years of practice. We investigate whether LLMs can serve as decision support for this rule-to-action step and introduce \ours{}, a benchmark of 6{,}690 expert-validated multiple-choice questions from 118 rule-action pairs across 16 asset types. We contribute (i) a symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling, (ii) five variants probing distinct failure modes (Pro, Pert, Verbose, Aug, Rationale), and (iii) a benchmark of 29 LLMs and 4 embedding baselines. A human evaluation (9 practitioners, mean 45.0\%) confirms \ours{} requires specialist knowledge beyond operational experience. Three findings stand out. The frontier has closed: the top three LLMs lie within one Macro point, with Bradley-Terry Elo placing claude-opus-4-6 30 points above the next model. Yet \ours{}\,Pro exposes brittleness, with every model losing 13--60\% relative accuracy under distractor expansion. \ours{}\,Aug exposes pattern-matching: under condition inversion, frontier models still select the original answer 49--63\% of the time. The deployment bottleneck is not capability but calibration: frontier models handle template-style fault detection but break under structural perturbation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DiagnosticIQ, a benchmark of 6,690 expert-validated MCQs derived from 118 rule-action pairs across 16 industrial asset types. It describes a symbolic-to-MCQA pipeline that normalizes rules to DNF and uses embedding-based distractor sampling, evaluates 29 LLMs across five variants (Pro, Pert, Verbose, Aug, Rationale), and reports that frontier models achieve high accuracy on template-style questions but suffer 13-60% relative accuracy drops and 49-63% inversion persistence under structural perturbations. Human evaluation with 9 practitioners yields 45% mean accuracy, supporting the claim that the deployment bottleneck is calibration rather than capability.
Significance. If the benchmark construction faithfully isolates specialist maintenance knowledge without pipeline artifacts, the work provides a useful domain-specific evaluation framework and dataset that highlights robustness limitations in current LLMs for industrial decision support. The quantitative findings on brittleness and pattern-matching, together with the human baseline, could inform targeted improvements in LLM calibration for safety-critical applications.
major comments (2)
- [§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.
- [Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.
minor comments (2)
- [Results] The abstract and results sections report relative accuracy drops (13-60%) and inversion persistence (49-63%) but do not specify the exact baseline accuracy values or the statistical method used to compute these figures.
- [§4 (Benchmark Results)] The Bradley-Terry Elo ranking places claude-opus-4-6 30 points above the next model, but the paper does not report the full ranking table or the number of pairwise comparisons underlying the model.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive suggestions. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 (Symbolic-to-MCQA Pipeline)] The symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) is load-bearing for the central claim that observed model drops reflect genuine calibration deficits rather than artifacts. DNF normalization can change rule scope or introduce logical equivalences absent from the original engineer-authored rules, and embedding similarity may select distractors based on vector proximity rather than domain plausibility. No concrete examples of original vs. normalized rules or analysis of semantic drift are provided to rule this out.
Authors: We agree that providing concrete examples and analysis of the pipeline would help address concerns about potential artifacts. In the revised manuscript, we will add an appendix containing side-by-side comparisons of original rules and their DNF-normalized versions for several asset types, along with a discussion of any introduced equivalences or scope changes. We will also include a manual validation by a domain expert on a subset of distractors to assess domain plausibility in addition to embedding similarity. revision: yes
-
Referee: [Human Evaluation] The human evaluation (9 practitioners, mean 45.0% accuracy) is used to validate that the questions require specialist knowledge. However, it lacks reported inter-rater reliability, statistical tests for significance, controls for general reasoning load, or comparison against operational-experience baselines, leaving open the possibility that model performance gaps partly reflect pipeline artifacts instead of the intended brittleness.
Authors: We acknowledge these reporting gaps in the human evaluation. We will include inter-rater reliability measures (e.g., Fleiss' kappa) and statistical tests for the reported accuracies in the revision. The original study design prioritized confirming the requirement for specialist knowledge through domain-experienced practitioners rather than including general reasoning controls or operational baselines; we will add a dedicated limitations paragraph discussing this choice and its implications for interpreting the results. The 45% mean accuracy nonetheless indicates the questions are challenging even for experts. revision: partial
Circularity Check
No circularity: purely empirical benchmark construction and evaluation
full rationale
The paper introduces DiagnosticIQ as an empirical benchmark derived from 118 engineer-authored rule-action pairs across 16 asset types. It describes a symbolic-to-MCQA pipeline (DNF normalization plus embedding-based distractor sampling) and five variants (Pro, Pert, Verbose, Aug, Rationale), then reports performance of 29 LLMs plus human validation (9 practitioners, mean 45%). No equations, fitted parameters, or predictions are defined in terms of the paper's own outputs. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claims (frontier models close in score but brittle under perturbation; bottleneck is calibration) rest on direct experimental measurements rather than any reduction to inputs by construction. This is standard benchmark methodology with no self-referential derivation chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Symbolic rules can be normalized to Disjunctive Normal Form without loss of meaning for generating valid multiple-choice action-recommendation questions.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery and embed_injective unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
symbolic-to-MCQA pipeline normalizing rules to Disjunctive Normal Form with embedding-based distractor sampling... DiagnosticIQ Aug... inverting all temporal comparison operators
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel and Jcost_pos_of_ne_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
frontier models handle template-style fault detection but break under structural perturbation
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
GEPA: Reflective prompt evolution can outperform reinforcement learning
Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alex Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, and Omar Khattab. GEPA: Reflective prompt evolution can outperform reinforcement learning. InThe Fourteenth International...
work page 2026
-
[2]
GPQA Diamond Benchmark Leaderboard
Artificial Analysis. GPQA Diamond Benchmark Leaderboard. https:// artificialanalysis.ai/evaluations/gpqa-diamond, 2026. Accessed: 2026-05-02
work page 2026
-
[3]
ASHRAE. Standards and guidelines. https://www.ashrae.org/technical-resources/ standards-and-guidelines, n.d. Accessed: 2025-11-20
work page 2025
-
[4]
A Beghi, R Brignoli, Luca Cecchinato, Gabriele Menegazzo, Mirco Rampazzo, and F Simmini. Data-driven fault detection and diagnosis for hvac water chillers.Control Engineering Practice, 53:79–91, 2016
work page 2016
- [5]
-
[7]
FinTextQA: A dataset for long-form financial question answering
Jian Chen, Peilin Zhou, Yining Hua, Loh Xin, Kehui Chen, Ziyuan Li, Bing Zhu, and Junwei Liang. FinTextQA: A dataset for long-form financial question answering. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6025–6047, Bangkok,...
work page 2024
-
[8]
Christodoulos Constantinides, Dhaval Patel, Shuxin Lin, Claudio Guerrero, Sunil Dagajirao Patil, and Jayant Kalagnanam. Failuresensoriq: A multi-choice qa dataset for understanding sensor relationships and failure modes, 2025. URLhttps://arxiv.org/abs/2506.03278
-
[9]
Andrew A Cook, Göksel Mısırlı, and Zhong Fan. Anomaly detection for iot time-series data: A survey.IEEE Internet of Things Journal, 7(7):6481–6494, 2019
work page 2019
-
[10]
Intelligent maintenance powered by iot and ai, Apr 2026
Cytiva. Intelligent maintenance powered by iot and ai, Apr 2026. URL https://www. cytivalifesciences.com/en/us/insights/intelligent-equipment-maintenance. Ac- cessed: 2026-05-02
work page 2026
-
[11]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms, 2023. URLhttps://arxiv.org/abs/2305.14314
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[12]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[13]
Seongtae Hong, Joong Min Shin, Jaehyung Seo, Taemin Lee, Jeongbae Park, Cho Man Young, Byeongho Choi, and Heuiseok Lim. Intelligent predictive maintenance RAG framework for power plants: Enhancing QA with StyleDFS and domain specific instruction tuning. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 20...
-
[14]
Machine learning for predictive maintenance of industrial machines using iot sensor data
Ameeth Kanawaday and Aditya Sane. Machine learning for predictive maintenance of industrial machines using iot sensor data. In2017 8th IEEE international conference on software engineering and service science (ICSESS), pages 87–90. IEEE, 2017. 11
work page 2017
-
[15]
Time-mqa: Time series multi-task question answering with context enhancement
Yaxuan Kong, Yiyuan Yang, Yoontae Hwang, Wenjie Du, Stefan Zohren, Zhangyang Wang, Ming Jin, and Qingsong Wen. Time-mqa: Time series multi-task question answering with context enhancement. InAnnual Meeting of the Association for Computational Linguistics,
-
[16]
URLhttps://api.semanticscholar.org/CorpusID:276774750
-
[17]
TelBench: A benchmark for evaluating telco-specific large language models
Sunwoo Lee, Dhammiko Arya, Seung-Mo Cho, Gyoung-eun Han, Seokyoung Hong, Won- beom Jang, Seojin Lee, Sohee Park, Sereimony Sek, Injee Song, Sungbin Yoon, and Eric Davis. TelBench: A benchmark for evaluating telco-specific large language models. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Confer...
-
[18]
Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,
Jiatong Li, Renjun Hu, Kunzhe Huang, Yan Zhuang, Qi Liu, Mengxiao Zhu, Xing Shi, and Wei Lin. Perteval: Unveiling real knowledge capacity of llms with knowledge-invariant perturbations,
- [19]
-
[20]
Guanjing Lin, John House, Yimin Chen, Jessica Granderson, and Wanpeng Zhang. Active multi-mode data analysis to improve fault diagnosis in ahus.Energy and Buildings, 337: 115621, 2025. ISSN 0378-7788. doi: https://doi.org/10.1016/j.enbuild.2025.115621. URL https://www.sciencedirect.com/science/article/pii/S0378778825003512
-
[21]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[22]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
A Comparative Benchmark of Large Language Models for Labelling Wind Turbine Maintenance Logs
Max Malyi, Jonathan Shek, Alasdair McDonald, and Andre Biscaya. A comparative benchmark of large language models for labelling wind turbine maintenance logs, 2025. URL https: //arxiv.org/abs/2509.06813
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
WFCRL: A multi-agent reinforcement learning benchmark for wind farm control
Claire Bizon Monroc, Ana Busic, Donatien Dubuc, and Jiamin Zhu. WFCRL: A multi-agent reinforcement learning benchmark for wind farm control. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024. URL https://openreview.net/forum?id=ZRMAhpZ3ED
work page 2024
-
[25]
Avisek Naug, Antonio Guillen-Perez, Vineet Kumar, Scott Greenwood, Wesley Brewer, Sa- hand Ghorbanpour, Ashwin Ramesh Babu, Vineet Gundecha, Ricardo Luna Gutierrez, and Soumyendu Sarkar. LC-opt: Benchmarking reinforcement learning and agentic AI for end- to-end liquid cooling optimization in data centers. InThe Thirty-ninth Annual Conference on Neural Inf...
work page 2026
-
[26]
Add failure diagnostics information to asset incidents and anomalies, 2025
Oracle. Add failure diagnostics information to asset incidents and anomalies, 2025. Oracle IoT Asset Monitoring Cloud Service
work page 2025
-
[27]
Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026
OxMaint. Prescriptive maintenance explained: Beyond predictive cmms, Apr 2026. URLhttps: //oxmaint.com/article/prescriptive-maintenance-cmms-guide . Accessed: 2026-05- 02
work page 2026
-
[28]
Dhaval Patel, Shuxin Lin, James Rayfield, Nianjun Zhou, Chathurangi Shyalika, Surya- narayana R Yarrabothula, Roman Vaculin, Natalia Martinez, Fearghal O’donncha, and Jayant Kalagnanam. Assetopsbench: Benchmarking ai agents for task automation in industrial asset operations and maintenance.arXiv preprint arXiv:2506.03828, 2025
-
[29]
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, et al. Humanity’s last exam.arXiv preprint arXiv:2501.14249, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments
Sourjyadip Ray, Kushal Gupta, Soumi Kundu, Dr Payal Arvind Kasat, Somak Aditya, and Pawan Goyal. ERVQA: A dataset to benchmark the readiness of large vision language models in hospital environments. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1...
-
[31]
M. Raza, Z. Jahangir, M. B. Riaz, et al. Industrial applications of large language models. Scientific Reports, 15:13755, 2025. doi: 10.1038/s41598-025-98483-1. URL https://doi. org/10.1038/s41598-025-98483-1
-
[32]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks.ArXiv, abs/1908.10084, 2019. URL https://api.semanticscholar.org/ CorpusID:201646309
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[33]
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. URL https: //openreview.net/forum?id=Ti67584b98
work page 2024
-
[34]
Leveraging large language models for multiple choice question answering, 2023
Joshua Robinson, Christopher Michael Rytting, and David Wingate. Leveraging large language models for multiple choice question answering, 2023. URL https://arxiv.org/abs/2210. 12353
work page 2023
-
[35]
ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures
Tobias Schimanski, Jingwei Ni, Roberto Spacey Martín, Nicola Ranger, and Markus Leippold. ClimRetrieve: A benchmarking dataset for information retrieval from corporate climate dis- closures. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17509– 175...
-
[36]
Anomaly detection in iiot: A case study using machine learning
Gauri Shah and Aashis Tiwari. Anomaly detection in iiot: A case study using machine learning. InProceedings of the ACM India joint international conference on data science and management of data, pages 295–300, 2018
work page 2018
-
[37]
SkyFoundry, LLC. SkySpark Analytics Platform. https://skyfoundry.com/product, 2026. Accessed: 2026-05-06
work page 2026
-
[38]
Large language models for forecasting and anomaly detection: A systematic literature review, 2024
Jing Su, Chufeng Jiang, Xin Jin, Yuxin Qiao, Tingsong Xiao, Hongda Ma, Rong Wei, Zhi Jing, Jiajun Xu, and Junhong Lin. Large language models for forecasting and anomaly detection: A systematic literature review, 2024. URLhttps://arxiv.org/abs/2402.10350
-
[39]
Ekaterina Sviridova, Anar Yeginbergen, Ainara Estarrona, Elena Cabrio, Serena Villata, and Rodrigo Agerri. CasiMedicos-arg: A medical question answering dataset annotated with explanatory argumentative structures. In Yaser Al-Onaizan, Mohit Bansal, and Yun- Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natu- ral Language Pr...
-
[40]
Yilin Wang, Peixuan Lei, Jie Song, Yuzhe Hao, Tao Chen, Yuxuan Zhang, Lei Jia, Yuanxiang Li, and Zhongyu Wei. Itformer: Bridging time series and natural language for multi-modal qa with large-scale multitask dataset. In42nd International Conference on Machine Learning, volume abs/2506.20093, 2025. URLhttps://api.semanticscholar.org/CorpusID:280000242
-
[41]
MMLU-pro: A more robust and challenging multi-task language understanding benchmark
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. MMLU-pro: A more robust and challenging multi-task language understanding benchmark. InThe Thirty-eight Conference on Neural Information Processin...
work page 2024
-
[42]
Benchmarking complex instruction-following with multiple constraints composition
Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxing Xu, Yiming Liu, Jie Tang, Hongning Wang, and Minlie Huang. Benchmarking complex instruction-following with multiple constraints composition. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 202...
work page 2024
-
[43]
Wikipedia contributors. Smarter Planet. https://en.wikipedia.org/wiki/Smarter_ Planet, 2026. Accessed: 2026-05-06
work page 2026
-
[44]
Puyu Yang, Laifa Tao, Zijian Huang, Haifei Liu, Wenyan Cao, Hao Ji, Jianan Qiu, Qixuan Huang, Xuanyuan Su, Yuhang Xie, Jun Zhang, Shangyu Li, Chen Lu, and Zhixuan Lian. Phm- bench: A domain-specific benchmarking framework for systematic evaluation of large models in prognostics and health management, 2025. URLhttps://arxiv.org/abs/2508.02490
-
[45]
Fanghua Ye, Mingming Yang, Jianhui Pang, Longyue Wang, Derek Wong, Emine Yilmaz, Shuming Shi, and Zhaopeng Tu. Benchmarking llms via uncertainty quantification.Advances in Neural Information Processing Systems, 37:15356–15385, 2024
work page 2024
-
[46]
Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025
Feng Zhang, Chengjie Pang, Yuehan Zhang, and Chenyu Luo. Camb: A comprehensive industrial llm benchmark on civil aviation maintenance, 2025. URL https://arxiv.org/abs/ 2508.20420
-
[47]
Tianyang Zhang, Zhuoxuan Jiang, Shengguang Bai, Tianrui Zhang, Lin Lin, Yang Liu, and Jiawei Ren. RAG4ITOps: A supervised fine-tunable and comprehensive RAG framework for IT operations and maintenance. In Franck Dernoncourt, Daniel Preo¸ tiuc-Pietro, and Anastasia Shimorina, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Langua...
-
[48]
RuAG: Learned-rule-augmented generation for large language models
Yudi Zhang, Pei Xiao, Lu Wang, Chaoyun Zhang, Meng Fang, Yali Du, Yevgeniy Puzyrev, Randolph Yao, Si Qin, Qingwei Lin, Mykola Pechenizkiy, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. RuAG: Learned-rule-augmented generation for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.ne...
work page 2025
-
[49]
Multiple-choice questions are efficient and robust llm evaluators, 2024
Ziyin Zhang, Zhaokun Jiang, Lizhen Xu, Hongkun Hao, and Rui Wang. Multiple-choice questions are efficient and robust llm evaluators, 2024. URLhttps://arxiv.org/abs/2405. 11966
work page 2024
-
[50]
Large language models are not robust multiple choice selectors.arXiv preprint arXiv:2309.03882, 2023
Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors, 2024. URLhttps://arxiv.org/abs/2309.03882
-
[51]
Keyi Zhong, Tom Jackson, Andrew West, and Georgina Cosma. Natural language processing approaches in industrial maintenance: A systematic literature review.Procedia Computer Sci- ence, 232:2082–2097, 2024. ISSN 1877-0509. doi: https://doi.org/10.1016/j.procs.2024.02.029. URL https://www.sciencedirect.com/science/article/pii/S1877050924002060. 5th Internati...
-
[52]
2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc
Condition extraction.Atomic conditions {QC i k} are extracted from the rule’s condition treeT R i. 2.Asset retrieval.The matching asset descriptionAD i is retrieved fromDesc
-
[53]
Option generation.Candidate (question, options, answer) tuples are produced via two strategies:selection(relevant observations RRSim-similar to the rule action) andelimina- tion(irrelevant observations as distractors)
-
[54]
Question assembly.Each extracted condition is combined with each (question, options, answer) tuple to produce one MCQA item, which is appended to the datasetDS Q. An example generated prompt is shown in Figure 15. Pipeline extensibility.To demonstrate that the MCQA-generation pipeline extends beyond our primary rule corpus, we applied it to 28 rules manua...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.