CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems
Pith reviewed 2026-05-18 05:09 UTC · model grok-4.3
The pith
CPSLint is a domain-specific language that lets non-programmers detect and correct common corruption patterns in industrial sensor time-series data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CPSLint is a domain-specific language for data validation and sanitisation in industrial cyber-physical systems. It automatically detects and corrects common data corruption patterns in semi-structured time-series data and enables non-programming domain experts to define and apply validation rules for preparing datasets for machine learning applications such as fault detection and identification.
What carries the argument
CPSLint, a domain-specific language whose rules encode detection and correction of common corruption patterns in semi-structured industrial time-series sensor data.
If this is right
- Data preparation for machine learning becomes feasible without hiring programmers or writing custom scripts for each project.
- Consistency of cleaned datasets improves because the same rules apply uniformly across different time-series sources.
- Downstream fault detection models gain reliability from input data that has been systematically sanitised.
- The same language can be reused on multiple CPS datasets without redesigning the validation logic each time.
Where Pith is reading between the lines
- If the patterns prove general, CPSLint-style rules could be shared as libraries across different industrial sectors that collect sensor data.
- The DSL approach might extend to other semi-structured domains such as environmental monitoring or medical device logs where similar corruption issues arise.
- Adding automated suggestions for rule creation could further reduce the expertise needed from domain users.
Load-bearing premise
The corruption patterns seen in the tested industrial CPS datasets are general enough that rules written by non-programmers in the DSL will work on new projects and systems.
What would settle it
A test in which non-programming domain experts use CPSLint on a fresh industrial dataset and the tool fails to catch or fix most of the actual corruptions present would disprove the central claim.
Figures
read the original abstract
Industrial cyber-physical systems generate vast amounts of semi-structured time-series data that require careful preprocessing before they can be effectively used for machine learning applications such as fault detection and identification. Raw sensor datasets are often corrupted or incomplete, making it challenging to develop reliable solutions without proper data preparation and validation. In this paper, we introduce CPSLint, a domain-specific language for data validation and sanitisation. We present the design, implementation and evaluation of CPSLint, demonstrating its ability to automatically detect and correct common data corruption patterns while enabling non-programming domain experts to effectively prepare their data for analysis. We report evaluation results on a representative dataset, tracking memory consumption and CPU-time for sanitisation activities. Our approach offers several advantages over traditional methods, including reduced manual effort, guaranteed consistency and broader applicability across time-series datasets and projects.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CPSLint, a domain-specific language for data validation and sanitisation in industrial cyber-physical systems. It presents the design, implementation, and evaluation of the DSL, claiming that it automatically detects and corrects common data corruption patterns in semi-structured time-series data from CPS, enabling non-programming domain experts to prepare data for machine learning tasks such as fault detection. Evaluation results are reported on a representative dataset, focusing on memory consumption and CPU-time for sanitisation activities, with advantages claimed over traditional methods including reduced manual effort and guaranteed consistency.
Significance. If the central claims hold, CPSLint would address a practical challenge in CPS data preprocessing by providing a tailored DSL that reduces reliance on programmers and ensures consistent handling of corruption patterns across projects. This could facilitate broader adoption of ML-based fault detection in industrial settings. The work's emphasis on performance metrics for a representative dataset provides some grounding in resource efficiency, but the absence of demonstrated usability or correctness leaves the significance conditional on further validation.
major comments (2)
- [Evaluation] Evaluation section: The reported results track only memory consumption and CPU-time on a representative dataset. This provides no evidence on sanitisation correctness (e.g., via ground-truth comparison of outputs) or usability by non-programming domain experts (e.g., via user studies or rule-authoring assessments), which are required to substantiate the central claim that CPSLint enables effective data preparation by domain experts.
- [Abstract and Design] Abstract and Design sections: The claim that CPSLint 'automatically detects and corrects common data corruption patterns' and offers 'guaranteed consistency' lacks any description of rule definition mechanisms, formal correctness guarantees, or how the DSL syntax supports non-experts, leaving the generality of observed patterns and the enabling claim as untested assumptions rather than demonstrated results.
minor comments (1)
- [Abstract] The abstract refers to 'broader applicability across time-series datasets and projects' without specifying the scope of the representative dataset or any cross-dataset validation experiments.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the manuscript's content and indicating planned revisions where appropriate to better substantiate our claims.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: The reported results track only memory consumption and CPU-time on a representative dataset. This provides no evidence on sanitisation correctness (e.g., via ground-truth comparison of outputs) or usability by non-programming domain experts (e.g., via user studies or rule-authoring assessments), which are required to substantiate the central claim that CPSLint enables effective data preparation by domain experts.
Authors: We agree that the evaluation section focuses on performance metrics (memory consumption and CPU-time) for a representative dataset to demonstrate practical efficiency in industrial settings. The manuscript's design section explains how the DSL enables non-experts through its syntax for specifying common corruption patterns, with automatic correction applied deterministically. However, we acknowledge the absence of explicit ground-truth comparisons or user studies. In revision, we will expand the evaluation to include a qualitative discussion of correctness via the rule application process and add a note on intended usability for domain experts, while marking full empirical validation as future work. revision: partial
-
Referee: [Abstract and Design] Abstract and Design sections: The claim that CPSLint 'automatically detects and corrects common data corruption patterns' and offers 'guaranteed consistency' lacks any description of rule definition mechanisms, formal correctness guarantees, or how the DSL syntax supports non-experts, leaving the generality of observed patterns and the enabling claim as untested assumptions rather than demonstrated results.
Authors: The design section presents the DSL syntax and rule definition mechanisms, allowing domain experts to specify patterns for issues such as missing values or outliers using high-level constructs without requiring programming expertise. Automatic detection and correction occur through deterministic rule application, which underpins the claim of guaranteed consistency. We do not provide formal proofs, as the work emphasises practical implementation over theoretical verification. We will revise the abstract and design sections to include more explicit examples of rule syntax and clarify how it supports non-experts. revision: yes
- The manuscript does not contain user studies or formal correctness proofs, which cannot be added without new empirical or theoretical work.
Circularity Check
No circularity: tool introduction with independent implementation and metrics
full rationale
The paper presents the design and implementation of a new DSL (CPSLint) for validating and sanitising CPS time-series data, followed by performance evaluation on memory and CPU metrics for a representative dataset. No mathematical derivations, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations appear in the provided abstract or described structure. The central claims rest on the explicit construction of the DSL and reported runtime measurements, which are independent of the claims themselves and do not reduce by construction to prior inputs or author citations. This is a standard engineering/tool paper with no evidence of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Common data corruption patterns exist in industrial CPS time-series data and can be automatically detected and corrected by a DSL.
invented entities (1)
-
CPSLint DSL
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Boute, Stijn Loeys, and Heletjé E
Joachim Arts, Robert N. Boute, Stijn Loeys, and Heletjé E. van Staden. 2025. Fifty Years of Maintenance Optimization: Reflections and Perspectives. European Journal of Operational Research . doi: 10.1016/j.ejor.2024.07.002
-
[2]
Seokgoo Kim, Joo-Ho Choi, and Nam H. Kim. 2021. Challenges and Opportuni- ties of System-Level Prognostics. Sensors. doi: 10.3390/s21227655
-
[3]
Ferhat Tamssaouet, Khanh Tp Nguyen, Kamal Medjaher, and Marcos Eduardo Orchard. 2023. System-Level Failure Prognostics: Literature Review and Main Challenges. Proceedings of the Institution of Mechanical Engineers, Part O: Jour- nal of Risk and Reliability . doi: 10.1177/1748006X221118448
-
[4]
[SW] Uraz Odyurt, Ömer Sayilir, Mariëlle Stoelinga, and Vadim Zaytsev, CP- SLint version v0.5.0, Oct. 2025. doi: 10.5281/zenodo.17406796
-
[5]
Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and How to Develop Domain-Specific Languages. ACM Computing Surveys. doi: 10.1145/1 118890.1118892
work page doi:10.1145/1 2005
-
[6]
Pimentel, and Ignacio Gonzalez Alonso
Uraz Odyurt, Andy D. Pimentel, and Ignacio Gonzalez Alonso. 2022. Improving the robustness of industrial Cyber–Physical Systems through machine learning- based performance anomaly identification. Journal of Systems Architecture. doi: 10.1016/j.sysarc.2022.102716
-
[7]
Pimentel, Ignacio Gonzalez Alonso, and Cees de Laat
Uraz Odyurt, Julius Roeder, Andy D. Pimentel, Ignacio Gonzalez Alonso, and Cees de Laat. 2021. Power Passports for Fault Tolerance: Anomaly Detection in Industrial CPS Using Electrical EFB. In2021 4th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS) . doi: 10.1109/ICPS49255.2021.9468262
-
[8]
Vadim Zaytsev. 2017. Language Design with Intent. In 2017 ACM/IEEE 20th International Conference on Model Driven Engineering Languages and Systems (MoDELS). doi: 10.1109/MODELS.2017.16
-
[9]
Paul Klint, Tijs van der Storm, and Jurgen Vinju. 2009. RASCAL: A Domain Spe- cific Language for Source Code Analysis and Manipulation. In 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation . doi: 10.1109/SCAM.2009.28
-
[10]
Paul Klint, Tijs van der Storm, and Jurgen Vinju. 2011. EASY Meta-programming with Rascal. In Generative and Transformational Techniques in Software Engi- neering III: International Summer School, GTTSE 2009, Braga, Portugal, July 6-11,
work page 2011
-
[11]
Revised Papers. Springer. doi: 10.1007/978-3-642-18023-1_6
-
[12]
Federico Tomassetti and Vadim Zaytsev. 2020. Reflections on the Lack of Adop- tion of Domain Specific Languages. In STAF Workshop Proceedings (OOPSLE). http://ceur-ws.org/Vol-2707/oopslepaper5.pdf
work page 2020
-
[13]
GNU Project. 2025. GNU datamash. Retrieved Aug. 26, 2025 from https://www .gnu.org/software/datamash/
work page 2025
-
[14]
Anders Hoff. 2024. Lisp Query Notation — A DSL for Data Processing. (2024). doi: 10.5281/zenodo.11001584
-
[15]
Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2023. A Domain-Specific Language for Describing Machine Learning Datasets. Journal of Computer Languages. doi: 10.1016/j.cola.2023.101209
-
[16]
Felix Heine, Carsten Kleiner, and Thomas Oelsner. 2020. A DSL for Automated Data Quality Monitoring. In Database and Expert Systems Applications . doi: 10.1007/978-3-030-59003-1_6
-
[17]
Alfonso de la Vega, Diego García-Saiz, Marta Zorrilla, and Pablo Sánchez. 2020. Lavoisier: A DSL for Increasing the Level of Abstraction of Data Selection and Formatting in Data Mining. Journal of Computer Languages . doi: 10.1016/j.cola .2020.100987
-
[18]
Brian Sal, Diego García-Saiz, Alfonso de la Vega, and Pablo Sánchez. 2024. Domain-Specific Languages for the Automated Generation of Datasets for Industry 4.0 Applications. Journal of Industrial Information Integration . doi: 10.1016/j.jii.2024.100657
-
[19]
Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An Embedded DSL for High Performance Big Data Processing. (2012). https://infoscience.epfl.ch/handle/20.500.14299/85985
work page 2012
-
[20]
Birgit Vogel-Heuser et al. 2025. DSL4DPiFS — A Graphical Notation to Model Data Pipeline Deployment in Forming Systems. at - Automatisierungstechnik. doi: doi:10.1515/auto-2024-0114. CPSLint: A DSL Providing Data Validation for Industrial CPS Table 1: Comparison of supported functionality amongst existing tools/DSLs and with CPSLint in the last row. The c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.