pith. sign in

arxiv: 2510.18651 · v2 · submitted 2025-10-21 · 💻 cs.PL · cs.SE

CPSLint: A Domain-Specific Language Providing Data Validation and Sanitisation for Industrial Cyber-Physical Systems

Pith reviewed 2026-05-18 05:09 UTC · model grok-4.3

classification 💻 cs.PL cs.SE
keywords domain-specific languagedata validationdata sanitisationcyber-physical systemstime-series datamachine learningfault detectiondata preprocessing
0
0 comments X

The pith

CPSLint is a domain-specific language that lets non-programmers detect and correct common corruption patterns in industrial sensor time-series data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CPSLint as a domain-specific language for validation and sanitisation of data from industrial cyber-physical systems. Raw sensor datasets are often corrupted or incomplete, which blocks reliable machine learning for tasks like fault detection. CPSLint encodes rules that automatically identify and fix recurring corruption patterns while allowing domain experts without programming skills to prepare their data. Evaluation on a representative dataset measures memory consumption and CPU time during sanitisation. The approach claims advantages in reduced manual work, consistent results, and wider use across time-series projects.

Core claim

CPSLint is a domain-specific language for data validation and sanitisation in industrial cyber-physical systems. It automatically detects and corrects common data corruption patterns in semi-structured time-series data and enables non-programming domain experts to define and apply validation rules for preparing datasets for machine learning applications such as fault detection and identification.

What carries the argument

CPSLint, a domain-specific language whose rules encode detection and correction of common corruption patterns in semi-structured industrial time-series sensor data.

If this is right

  • Data preparation for machine learning becomes feasible without hiring programmers or writing custom scripts for each project.
  • Consistency of cleaned datasets improves because the same rules apply uniformly across different time-series sources.
  • Downstream fault detection models gain reliability from input data that has been systematically sanitised.
  • The same language can be reused on multiple CPS datasets without redesigning the validation logic each time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the patterns prove general, CPSLint-style rules could be shared as libraries across different industrial sectors that collect sensor data.
  • The DSL approach might extend to other semi-structured domains such as environmental monitoring or medical device logs where similar corruption issues arise.
  • Adding automated suggestions for rule creation could further reduce the expertise needed from domain users.

Load-bearing premise

The corruption patterns seen in the tested industrial CPS datasets are general enough that rules written by non-programmers in the DSL will work on new projects and systems.

What would settle it

A test in which non-programming domain experts use CPSLint on a fresh industrial dataset and the tool fails to catch or fix most of the actual corruptions present would disprove the central claim.

Figures

Figures reproduced from arXiv: 2510.18651 by Mari\"elle Stoelinga, \"Omer Sayilir, Uraz Odyurt, Vadim Zaytsev.

Figure 1
Figure 1. Figure 1: An example CPS machine cycle, involving action [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Different data compartmentalisation granularities [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A tombstone diagram of CPSLint: The flow of activities from left to right, with inspection and inferring of the data structure happening on the right side, followed by Python code generation to deliver the executable code operating on the actual machine trace. The role of a domain expert overseeing the inferred CPSLint specification is emphasised. 1 # Generated by CPSlint 2 # Input : specification .cps 3 4… view at source ↗
read the original abstract

Industrial cyber-physical systems generate vast amounts of semi-structured time-series data that require careful preprocessing before they can be effectively used for machine learning applications such as fault detection and identification. Raw sensor datasets are often corrupted or incomplete, making it challenging to develop reliable solutions without proper data preparation and validation. In this paper, we introduce CPSLint, a domain-specific language for data validation and sanitisation. We present the design, implementation and evaluation of CPSLint, demonstrating its ability to automatically detect and correct common data corruption patterns while enabling non-programming domain experts to effectively prepare their data for analysis. We report evaluation results on a representative dataset, tracking memory consumption and CPU-time for sanitisation activities. Our approach offers several advantages over traditional methods, including reduced manual effort, guaranteed consistency and broader applicability across time-series datasets and projects.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces CPSLint, a domain-specific language for data validation and sanitisation in industrial cyber-physical systems. It presents the design, implementation, and evaluation of the DSL, claiming that it automatically detects and corrects common data corruption patterns in semi-structured time-series data from CPS, enabling non-programming domain experts to prepare data for machine learning tasks such as fault detection. Evaluation results are reported on a representative dataset, focusing on memory consumption and CPU-time for sanitisation activities, with advantages claimed over traditional methods including reduced manual effort and guaranteed consistency.

Significance. If the central claims hold, CPSLint would address a practical challenge in CPS data preprocessing by providing a tailored DSL that reduces reliance on programmers and ensures consistent handling of corruption patterns across projects. This could facilitate broader adoption of ML-based fault detection in industrial settings. The work's emphasis on performance metrics for a representative dataset provides some grounding in resource efficiency, but the absence of demonstrated usability or correctness leaves the significance conditional on further validation.

major comments (2)
  1. [Evaluation] Evaluation section: The reported results track only memory consumption and CPU-time on a representative dataset. This provides no evidence on sanitisation correctness (e.g., via ground-truth comparison of outputs) or usability by non-programming domain experts (e.g., via user studies or rule-authoring assessments), which are required to substantiate the central claim that CPSLint enables effective data preparation by domain experts.
  2. [Abstract and Design] Abstract and Design sections: The claim that CPSLint 'automatically detects and corrects common data corruption patterns' and offers 'guaranteed consistency' lacks any description of rule definition mechanisms, formal correctness guarantees, or how the DSL syntax supports non-experts, leaving the generality of observed patterns and the enabling claim as untested assumptions rather than demonstrated results.
minor comments (1)
  1. [Abstract] The abstract refers to 'broader applicability across time-series datasets and projects' without specifying the scope of the representative dataset or any cross-dataset validation experiments.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, clarifying the manuscript's content and indicating planned revisions where appropriate to better substantiate our claims.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: The reported results track only memory consumption and CPU-time on a representative dataset. This provides no evidence on sanitisation correctness (e.g., via ground-truth comparison of outputs) or usability by non-programming domain experts (e.g., via user studies or rule-authoring assessments), which are required to substantiate the central claim that CPSLint enables effective data preparation by domain experts.

    Authors: We agree that the evaluation section focuses on performance metrics (memory consumption and CPU-time) for a representative dataset to demonstrate practical efficiency in industrial settings. The manuscript's design section explains how the DSL enables non-experts through its syntax for specifying common corruption patterns, with automatic correction applied deterministically. However, we acknowledge the absence of explicit ground-truth comparisons or user studies. In revision, we will expand the evaluation to include a qualitative discussion of correctness via the rule application process and add a note on intended usability for domain experts, while marking full empirical validation as future work. revision: partial

  2. Referee: [Abstract and Design] Abstract and Design sections: The claim that CPSLint 'automatically detects and corrects common data corruption patterns' and offers 'guaranteed consistency' lacks any description of rule definition mechanisms, formal correctness guarantees, or how the DSL syntax supports non-experts, leaving the generality of observed patterns and the enabling claim as untested assumptions rather than demonstrated results.

    Authors: The design section presents the DSL syntax and rule definition mechanisms, allowing domain experts to specify patterns for issues such as missing values or outliers using high-level constructs without requiring programming expertise. Automatic detection and correction occur through deterministic rule application, which underpins the claim of guaranteed consistency. We do not provide formal proofs, as the work emphasises practical implementation over theoretical verification. We will revise the abstract and design sections to include more explicit examples of rule syntax and clarify how it supports non-experts. revision: yes

standing simulated objections not resolved
  • The manuscript does not contain user studies or formal correctness proofs, which cannot be added without new empirical or theoretical work.

Circularity Check

0 steps flagged

No circularity: tool introduction with independent implementation and metrics

full rationale

The paper presents the design and implementation of a new DSL (CPSLint) for validating and sanitising CPS time-series data, followed by performance evaluation on memory and CPU metrics for a representative dataset. No mathematical derivations, fitted parameters renamed as predictions, self-referential definitions, or load-bearing self-citations appear in the provided abstract or described structure. The central claims rest on the explicit construction of the DSL and reported runtime measurements, which are independent of the claims themselves and do not reduce by construction to prior inputs or author citations. This is a standard engineering/tool paper with no evidence of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that common corruption patterns in CPS data are both identifiable and correctable via a DSL accessible to non-programmers; no free parameters or invented entities beyond the language itself are described.

axioms (1)
  • domain assumption Common data corruption patterns exist in industrial CPS time-series data and can be automatically detected and corrected by a DSL.
    Invoked in the abstract when describing the language's ability to handle raw sensor datasets that are corrupted or incomplete.
invented entities (1)
  • CPSLint DSL no independent evidence
    purpose: To provide data validation and sanitisation rules for CPS time-series data.
    The language is the primary contribution introduced to address the stated data preparation challenges.

pith-pipeline@v0.9.0 · 5684 in / 1371 out tokens · 46338 ms · 2026-05-18T05:09:54.279480+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

  1. [1]

    Boute, Stijn Loeys, and Heletjé E

    Joachim Arts, Robert N. Boute, Stijn Loeys, and Heletjé E. van Staden. 2025. Fifty Years of Maintenance Optimization: Reflections and Perspectives. European Journal of Operational Research . doi: 10.1016/j.ejor.2024.07.002

  2. [2]

    Seokgoo Kim, Joo-Ho Choi, and Nam H. Kim. 2021. Challenges and Opportuni- ties of System-Level Prognostics. Sensors. doi: 10.3390/s21227655

  3. [3]

    Ferhat Tamssaouet, Khanh Tp Nguyen, Kamal Medjaher, and Marcos Eduardo Orchard. 2023. System-Level Failure Prognostics: Literature Review and Main Challenges. Proceedings of the Institution of Mechanical Engineers, Part O: Jour- nal of Risk and Reliability . doi: 10.1177/1748006X221118448

  4. [4]

    [SW] Uraz Odyurt, Ömer Sayilir, Mariëlle Stoelinga, and Vadim Zaytsev, CP- SLint version v0.5.0, Oct. 2025. doi: 10.5281/zenodo.17406796

  5. [5]

    Marjan Mernik, Jan Heering, and Anthony M. Sloane. 2005. When and How to Develop Domain-Specific Languages. ACM Computing Surveys. doi: 10.1145/1 118890.1118892

  6. [6]

    Pimentel, and Ignacio Gonzalez Alonso

    Uraz Odyurt, Andy D. Pimentel, and Ignacio Gonzalez Alonso. 2022. Improving the robustness of industrial Cyber–Physical Systems through machine learning- based performance anomaly identification. Journal of Systems Architecture. doi: 10.1016/j.sysarc.2022.102716

  7. [7]

    Pimentel, Ignacio Gonzalez Alonso, and Cees de Laat

    Uraz Odyurt, Julius Roeder, Andy D. Pimentel, Ignacio Gonzalez Alonso, and Cees de Laat. 2021. Power Passports for Fault Tolerance: Anomaly Detection in Industrial CPS Using Electrical EFB. In2021 4th IEEE International Conference on Industrial Cyber-Physical Systems (ICPS) . doi: 10.1109/ICPS49255.2021.9468262

  8. [8]

    Vadim Zaytsev. 2017. Language Design with Intent. In 2017 ACM/IEEE 20th International Conference on Model Driven Engineering Languages and Systems (MoDELS). doi: 10.1109/MODELS.2017.16

  9. [9]

    Paul Klint, Tijs van der Storm, and Jurgen Vinju. 2009. RASCAL: A Domain Spe- cific Language for Source Code Analysis and Manipulation. In 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation . doi: 10.1109/SCAM.2009.28

  10. [10]

    Paul Klint, Tijs van der Storm, and Jurgen Vinju. 2011. EASY Meta-programming with Rascal. In Generative and Transformational Techniques in Software Engi- neering III: International Summer School, GTTSE 2009, Braga, Portugal, July 6-11,

  11. [11]

    Springer

    Revised Papers. Springer. doi: 10.1007/978-3-642-18023-1_6

  12. [12]

    Federico Tomassetti and Vadim Zaytsev. 2020. Reflections on the Lack of Adop- tion of Domain Specific Languages. In STAF Workshop Proceedings (OOPSLE). http://ceur-ws.org/Vol-2707/oopslepaper5.pdf

  13. [13]

    GNU Project. 2025. GNU datamash. Retrieved Aug. 26, 2025 from https://www .gnu.org/software/datamash/

  14. [14]

    Anders Hoff. 2024. Lisp Query Notation — A DSL for Data Processing. (2024). doi: 10.5281/zenodo.11001584

  15. [15]

    Joan Giner-Miguelez, Abel Gómez, and Jordi Cabot. 2023. A Domain-Specific Language for Describing Machine Learning Datasets. Journal of Computer Languages. doi: 10.1016/j.cola.2023.101209

  16. [16]

    Felix Heine, Carsten Kleiner, and Thomas Oelsner. 2020. A DSL for Automated Data Quality Monitoring. In Database and Expert Systems Applications . doi: 10.1007/978-3-030-59003-1_6

  17. [17]

    Alfonso de la Vega, Diego García-Saiz, Marta Zorrilla, and Pablo Sánchez. 2020. Lavoisier: A DSL for Increasing the Level of Abstraction of Data Selection and Formatting in Data Mining. Journal of Computer Languages . doi: 10.1016/j.cola .2020.100987

  18. [18]

    Brian Sal, Diego García-Saiz, Alfonso de la Vega, and Pablo Sánchez. 2024. Domain-Specific Languages for the Automated Generation of Datasets for Industry 4.0 Applications. Journal of Industrial Information Integration . doi: 10.1016/j.jii.2024.100657

  19. [19]

    Stefan Ackermann, Vojin Jovanovic, Tiark Rompf, and Martin Odersky. 2012. Jet: An Embedded DSL for High Performance Big Data Processing. (2012). https://infoscience.epfl.ch/handle/20.500.14299/85985

  20. [20]

    Birgit Vogel-Heuser et al. 2025. DSL4DPiFS — A Graphical Notation to Model Data Pipeline Deployment in Forming Systems. at - Automatisierungstechnik. doi: doi:10.1515/auto-2024-0114. CPSLint: A DSL Providing Data Validation for Industrial CPS Table 1: Comparison of supported functionality amongst existing tools/DSLs and with CPSLint in the last row. The c...