Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods

Aparna Muraleedharan; Fabian Jirasek; Hans Hasse; Heike Leitte; Indra Jungjohann; Jakob Burger; Justus Arweiler; Kerstin M\"unnemann

arxiv: 2510.18075 · v2 · submitted 2025-10-20 · 💻 cs.LG

Batch Distillation Data for Developing Machine Learning Anomaly Detection Methods

Justus Arweiler , Indra Jungjohann , Aparna Muraleedharan , Heike Leitte , Jakob Burger , Kerstin M\"unnemann , Fabian Jirasek , Hans Hasse This is my paper

Pith reviewed 2026-05-18 05:33 UTC · model grok-4.3

classification 💻 cs.LG

keywords batch distillationanomaly detectionmachine learningexperimental databasechemical processesfault diagnosistime series dataopen data

0 comments

The pith

A new open database of laboratory batch distillation experiments with induced anomalies and expert labels supports training of machine learning anomaly detection methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors operated a laboratory-scale batch distillation plant to collect time-series data from 119 experiments across varied conditions and mixtures. Most anomaly-containing runs were paired with matching fault-free ones, and the data include sensor readings, actuator states, online NMR concentration profiles, video, and audio, plus measurement uncertainties. Expert annotations based on a developed ontology identify the causes of each anomaly. The full structured database is released publicly to overcome the scarcity of open experimental data needed for machine learning in chemical process monitoring.

Core claim

By conducting and documenting 119 controlled batch distillation runs that include both normal operation and deliberately introduced anomalies, together with rich sensor data, unconventional measurement streams, and cause-specific annotations, the work supplies a public resource that directly enables supervised and interpretable machine learning for anomaly detection and mitigation in chemical engineering.

What carries the argument

The custom anomaly ontology together with expert annotations that label the specific causes of each induced fault in the time-series records.

If this is right

Supervised machine learning models can be trained on the labeled time-series data to detect anomalies in similar processes.
Explainable AI techniques can use the cause annotations to identify which physical faults trigger specific sensor deviations.
Methods for anomaly mitigation can be tested by replaying the recorded sequences and evaluating corrective actions.
Multimodal detection algorithms can incorporate the provided NMR, video, and audio streams alongside conventional sensors.
Standardized benchmarking of different anomaly detection algorithms becomes feasible on a shared experimental set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar annotated experimental collections could be created for other unit operations such as continuous distillation or reaction systems.
The database could support transfer-learning studies that adapt models from laboratory to pilot-scale equipment.
Process control engineers might use the labeled fault data to design automated recovery sequences that activate once an anomaly is confirmed.
Long-term use of the data might reveal recurring early-warning patterns that allow preventive maintenance before faults fully develop.

Load-bearing premise

The laboratory-scale experiments with intentionally induced anomalies produce sensor signatures and anomaly types that match those occurring in real industrial batch distillation processes.

What would settle it

Side-by-side comparison of anomaly types and sensor patterns recorded from an operating industrial batch distillation column against the laboratory database; large mismatches in frequency, duration, or signal characteristics would show the data are not representative.

read the original abstract

Machine learning (ML) holds great potential to advance anomaly detection (AD) in chemical processes. However, the development of ML-based methods is hindered by the lack of openly available experimental data. To address this gap, we have set up a laboratory-scale batch distillation plant and operated it to generate an extensive experimental database, covering fault-free experiments and experiments in which anomalies were intentionally induced, for training advanced ML-based AD methods. In total, 119 experiments were conducted across a wide range of operating conditions and mixtures. Most experiments containing anomalies were paired with a corresponding fault-free one. The database that we provide here includes time-series data from numerous sensors and actuators, along with estimates of measurement uncertainty. In addition, unconventional data sources -- such as concentration profiles obtained via online benchtop NMR spectroscopy and video and audio recordings -- are provided. Extensive metadata and expert annotations of all experiments are included. The anomaly annotations are based on an ontology developed in this work. The data are organized in a structured database and made freely available via doi.org/10.5281/zenodo.17395543. This new database paves the way for the development of advanced ML-based AD methods. As it includes information on the causes of anomalies, it further enables the development of interpretable and explainable ML approaches, as well as methods for anomaly mitigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward data-release paper with a new multimodal lab dataset for anomaly detection in batch distillation, but the induced faults may not reflect real industrial conditions.

read the letter

The main thing to know is that the authors ran 119 experiments on a single laboratory-scale batch distillation plant, paired most anomalous runs with fault-free ones, and released the full sensor time series plus online NMR profiles, video, audio, uncertainty estimates, and expert annotations based on their own anomaly ontology. The data sits on Zenodo with a clear structure and metadata. That fills a documented gap in open experimental records for ML work in chemical process monitoring. The pairing and the unconventional modalities are practical additions that could support multimodal or explainable models. The experimental campaign looks methodical on the surface, with operating conditions varied across mixtures and runs. The soft spot is the leap from lab to plant. All anomalies were deliberately induced on one small setup, and the paper gives no direct comparison of time-series signatures, durations, or causal mechanisms against documented industrial faults. Scale differences in heat and mass transfer, sensor placement, and coupled disturbances are left unaddressed, so the practical relevance for real-world anomaly detection or mitigation methods stays unverified. This paper is for ML researchers and process engineers who need benchmark data to train and test detectors. It deserves a serious referee because the contribution is concrete, the data are new and usable, and the field has few comparable open resources even if later users will have to test transferability themselves.

Referee Report

1 major / 3 minor

Summary. The manuscript reports the generation and public release of an experimental database from 119 runs on a single laboratory-scale batch distillation plant. It includes paired fault-free and intentionally induced anomaly experiments across varied operating conditions and mixtures, with time-series sensor/actuator data, measurement uncertainty estimates, online benchtop NMR concentration profiles, video and audio recordings, extensive metadata, and expert annotations grounded in a custom anomaly ontology. The data are structured and deposited at Zenodo (doi.org/10.5281/zenodo.17395543). The central claim is that this resource addresses the lack of open experimental data and thereby enables development of advanced, interpretable ML-based anomaly detection and mitigation methods for chemical processes.

Significance. If the data are of high quality and the annotations reliable, the release fills a documented gap in publicly available, richly annotated process data for ML research in chemical engineering. Strengths include the structured campaign design with uncertainty quantification, paired runs, inclusion of unconventional modalities (NMR, video/audio), and the developed anomaly ontology that links observations to causes. These features directly support training of explainable models and mitigation strategies. However, the practical significance for industrial chemical processes remains conditional on the degree to which the laboratory-scale, intentionally induced anomalies reproduce real-world fault signatures, frequencies, and causal mechanisms.

major comments (1)

[Abstract and §2] Abstract and §2 (Methods): The claim that the database 'paves the way for the development of advanced ML-based AD methods' for chemical processes and enables 'interpretable and explainable ML approaches, as well as methods for anomaly mitigation' rests on the unverified assumption that the laboratory-scale induced anomalies produce representative sensor signatures, durations, and causal mechanisms. No comparison is provided to documented industrial batch distillation faults (differences in scale, heat/mass transfer, sensor placement, and coupled disturbances are expected). This is load-bearing for the stated significance and should be addressed either by adding a dedicated limitations/comparison subsection or by tempering the applicability claims.

minor comments (3)

[§3] §3 (Data organization): Clarify the exact file-naming convention and directory structure for the 119 experiments so that users can unambiguously map metadata, annotations, and raw time-series files without additional documentation.
[Table 1] Table 1 or equivalent: Provide a summary table listing the number of experiments per anomaly class (from the ontology) and per mixture type to give readers an immediate overview of class balance.
[§4] §4 (Annotations): Specify the number of independent experts who performed the annotations and any inter-annotator agreement metric; this strengthens the reliability claim for the ontology-based labels.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive and detailed review. We have addressed the major comment by adding a dedicated limitations subsection and by tempering the applicability claims in the abstract and introduction, while maintaining the core contribution of the open experimental database.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (Methods): The claim that the database 'paves the way for the development of advanced ML-based AD methods' for chemical processes and enables 'interpretable and explainable ML approaches, as well as methods for anomaly mitigation' rests on the unverified assumption that the laboratory-scale induced anomalies produce representative sensor signatures, durations, and causal mechanisms. No comparison is provided to documented industrial batch distillation faults (differences in scale, heat/mass transfer, sensor placement, and coupled disturbances are expected). This is load-bearing for the stated significance and should be addressed either by adding a dedicated limitations/comparison subsection or by tempering the applicability claims.

Authors: We agree that laboratory-scale experiments cannot fully replicate industrial conditions, including differences in scale, heat/mass transfer, sensor placement, and coupled disturbances. The induced anomalies were chosen to represent common fault classes (e.g., leaks, blockages, sensor drift, and actuator failures) under controlled yet varied operating conditions, with expert annotations linked to an anomaly ontology to support explainable and mitigation-focused ML research. Because detailed industrial batch-distillation fault datasets are proprietary and not publicly available, a quantitative side-by-side comparison is not feasible. To directly address the concern, we have inserted a new subsection titled 'Limitations and Scope of Applicability' in the Discussion that explicitly discusses scale effects, the intentional nature of the anomalies, and the expected differences in sensor signatures and disturbance coupling. We have also revised the abstract and the relevant paragraphs in Section 2 to temper the language, replacing stronger claims with statements that the database 'supports the development and benchmarking' of such methods rather than asserting direct industrial representativeness. revision: yes

standing simulated objections not resolved

Direct quantitative comparison of sensor signatures, durations, and causal mechanisms to proprietary industrial batch distillation fault data, which is not openly accessible.

Circularity Check

0 steps flagged

No circularity: experimental data release with no derivations or fitted predictions

full rationale

The paper contains no mathematical derivations, predictions, or fitted models. Its sole contribution is the direct release of 119 laboratory-scale batch distillation experiments (time-series sensor/actuator data, online NMR concentration profiles, video/audio recordings, measurement uncertainties, metadata, and expert annotations based on a new ontology). The forward-looking statement that the database 'paves the way for the development of advanced ML-based AD methods' is not a derived quantitative result but an assertion of potential utility; it does not reduce to any self-referential equation, self-citation chain, or input-by-construction. All content is independent experimental observation and annotation, satisfying the criteria for a self-contained, non-circular data publication.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central contribution rests on experimental data collection and annotation rather than mathematical axioms or derivations; the main untested premise is that lab-induced anomalies generalize to industrial settings.

axioms (1)

domain assumption Laboratory-scale batch distillation experiments with intentionally induced anomalies produce sensor signatures representative of real industrial faults.
This assumption underpins the utility of the generated data for training ML models intended for industrial use.

invented entities (1)

Anomaly ontology no independent evidence
purpose: Systematic classification and annotation of anomaly causes across experiments.
Developed in this work to label the dataset and enable interpretable ML; no independent external validation provided.

pith-pipeline@v0.9.0 · 5800 in / 1327 out tokens · 53816 ms · 2026-05-18T05:33:23.444204+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

UTOPYA: A Multimodal Deep Learning Framework for Physics-Informed Anomaly Detection and Time-Series Prediction
cs.LG 2026-05 unverdicted novelty 5.0

UTOPYA fuses eight modalities via FiLM-conditioned attention and physics-informed regularization to reach AUROC 0.874 for anomaly detection in batch distillation, outperforming baselines by 0.147.