Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods
Pith reviewed 2026-05-18 05:33 UTC · model grok-4.3
The pith
A new open database of laboratory batch distillation experiments with induced anomalies and expert labels supports training of machine learning anomaly detection methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By conducting and documenting 119 controlled batch distillation runs that include both normal operation and deliberately introduced anomalies, together with rich sensor data, unconventional measurement streams, and cause-specific annotations, the work supplies a public resource that directly enables supervised and interpretable machine learning for anomaly detection and mitigation in chemical engineering.
What carries the argument
The custom anomaly ontology together with expert annotations that label the specific causes of each induced fault in the time-series records.
If this is right
- Supervised machine learning models can be trained on the labeled time-series data to detect anomalies in similar processes.
- Explainable AI techniques can use the cause annotations to identify which physical faults trigger specific sensor deviations.
- Methods for anomaly mitigation can be tested by replaying the recorded sequences and evaluating corrective actions.
- Multimodal detection algorithms can incorporate the provided NMR, video, and audio streams alongside conventional sensors.
- Standardized benchmarking of different anomaly detection algorithms becomes feasible on a shared experimental set.
Where Pith is reading between the lines
- Similar annotated experimental collections could be created for other unit operations such as continuous distillation or reaction systems.
- The database could support transfer-learning studies that adapt models from laboratory to pilot-scale equipment.
- Process control engineers might use the labeled fault data to design automated recovery sequences that activate once an anomaly is confirmed.
- Long-term use of the data might reveal recurring early-warning patterns that allow preventive maintenance before faults fully develop.
Load-bearing premise
The laboratory-scale experiments with intentionally induced anomalies produce sensor signatures and anomaly types that match those occurring in real industrial batch distillation processes.
What would settle it
Side-by-side comparison of anomaly types and sensor patterns recorded from an operating industrial batch distillation column against the laboratory database; large mismatches in frequency, duration, or signal characteristics would show the data are not representative.
read the original abstract
Machine learning (ML) holds great potential to advance anomaly detection (AD) in chemical processes. However, the development of ML-based methods is hindered by the lack of openly available experimental data. To address this gap, we have set up a laboratory-scale batch distillation plant and operated it to generate an extensive experimental database, covering fault-free experiments and experiments in which anomalies were intentionally induced, for training advanced ML-based AD methods. In total, 119 experiments were conducted across a wide range of operating conditions and mixtures. Most experiments containing anomalies were paired with a corresponding fault-free one. The database that we provide here includes time-series data from numerous sensors and actuators, along with estimates of measurement uncertainty. In addition, unconventional data sources -- such as concentration profiles obtained via online benchtop NMR spectroscopy and video and audio recordings -- are provided. Extensive metadata and expert annotations of all experiments are included. The anomaly annotations are based on an ontology developed in this work. The data are organized in a structured database and made freely available via doi.org/10.5281/zenodo.17395543. This new database paves the way for the development of advanced ML-based AD methods. As it includes information on the causes of anomalies, it further enables the development of interpretable and explainable ML approaches, as well as methods for anomaly mitigation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports the generation and public release of an experimental database from 119 runs on a single laboratory-scale batch distillation plant. It includes paired fault-free and intentionally induced anomaly experiments across varied operating conditions and mixtures, with time-series sensor/actuator data, measurement uncertainty estimates, online benchtop NMR concentration profiles, video and audio recordings, extensive metadata, and expert annotations grounded in a custom anomaly ontology. The data are structured and deposited at Zenodo (doi.org/10.5281/zenodo.17395543). The central claim is that this resource addresses the lack of open experimental data and thereby enables development of advanced, interpretable ML-based anomaly detection and mitigation methods for chemical processes.
Significance. If the data are of high quality and the annotations reliable, the release fills a documented gap in publicly available, richly annotated process data for ML research in chemical engineering. Strengths include the structured campaign design with uncertainty quantification, paired runs, inclusion of unconventional modalities (NMR, video/audio), and the developed anomaly ontology that links observations to causes. These features directly support training of explainable models and mitigation strategies. However, the practical significance for industrial chemical processes remains conditional on the degree to which the laboratory-scale, intentionally induced anomalies reproduce real-world fault signatures, frequencies, and causal mechanisms.
major comments (1)
- [Abstract and §2] Abstract and §2 (Methods): The claim that the database 'paves the way for the development of advanced ML-based AD methods' for chemical processes and enables 'interpretable and explainable ML approaches, as well as methods for anomaly mitigation' rests on the unverified assumption that the laboratory-scale induced anomalies produce representative sensor signatures, durations, and causal mechanisms. No comparison is provided to documented industrial batch distillation faults (differences in scale, heat/mass transfer, sensor placement, and coupled disturbances are expected). This is load-bearing for the stated significance and should be addressed either by adding a dedicated limitations/comparison subsection or by tempering the applicability claims.
minor comments (3)
- [§3] §3 (Data organization): Clarify the exact file-naming convention and directory structure for the 119 experiments so that users can unambiguously map metadata, annotations, and raw time-series files without additional documentation.
- [Table 1] Table 1 or equivalent: Provide a summary table listing the number of experiments per anomaly class (from the ontology) and per mixture type to give readers an immediate overview of class balance.
- [§4] §4 (Annotations): Specify the number of independent experts who performed the annotations and any inter-annotator agreement metric; this strengthens the reliability claim for the ontology-based labels.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We have addressed the major comment by adding a dedicated limitations subsection and by tempering the applicability claims in the abstract and introduction, while maintaining the core contribution of the open experimental database.
read point-by-point responses
-
Referee: [Abstract and §2] Abstract and §2 (Methods): The claim that the database 'paves the way for the development of advanced ML-based AD methods' for chemical processes and enables 'interpretable and explainable ML approaches, as well as methods for anomaly mitigation' rests on the unverified assumption that the laboratory-scale induced anomalies produce representative sensor signatures, durations, and causal mechanisms. No comparison is provided to documented industrial batch distillation faults (differences in scale, heat/mass transfer, sensor placement, and coupled disturbances are expected). This is load-bearing for the stated significance and should be addressed either by adding a dedicated limitations/comparison subsection or by tempering the applicability claims.
Authors: We agree that laboratory-scale experiments cannot fully replicate industrial conditions, including differences in scale, heat/mass transfer, sensor placement, and coupled disturbances. The induced anomalies were chosen to represent common fault classes (e.g., leaks, blockages, sensor drift, and actuator failures) under controlled yet varied operating conditions, with expert annotations linked to an anomaly ontology to support explainable and mitigation-focused ML research. Because detailed industrial batch-distillation fault datasets are proprietary and not publicly available, a quantitative side-by-side comparison is not feasible. To directly address the concern, we have inserted a new subsection titled 'Limitations and Scope of Applicability' in the Discussion that explicitly discusses scale effects, the intentional nature of the anomalies, and the expected differences in sensor signatures and disturbance coupling. We have also revised the abstract and the relevant paragraphs in Section 2 to temper the language, replacing stronger claims with statements that the database 'supports the development and benchmarking' of such methods rather than asserting direct industrial representativeness. revision: yes
- Direct quantitative comparison of sensor signatures, durations, and causal mechanisms to proprietary industrial batch distillation fault data, which is not openly accessible.
Circularity Check
No circularity: experimental data release with no derivations or fitted predictions
full rationale
The paper contains no mathematical derivations, predictions, or fitted models. Its sole contribution is the direct release of 119 laboratory-scale batch distillation experiments (time-series sensor/actuator data, online NMR concentration profiles, video/audio recordings, measurement uncertainties, metadata, and expert annotations based on a new ontology). The forward-looking statement that the database 'paves the way for the development of advanced ML-based AD methods' is not a derived quantitative result but an assertion of potential utility; it does not reduce to any self-referential equation, self-citation chain, or input-by-construction. All content is independent experimental observation and annotation, satisfying the criteria for a self-contained, non-circular data publication.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Laboratory-scale batch distillation experiments with intentionally induced anomalies produce sensor signatures representative of real industrial faults.
invented entities (1)
-
Anomaly ontology
no independent evidence
Forward citations
Cited by 1 Pith paper
-
UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction
UTOPYA fuses eight modalities via FiLM-conditioned attention and physics-informed regularization to reach AUROC 0.874 for anomaly detection in batch distillation, outperforming baselines by 0.147.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.