3W Dataset 2.0.0: a realistic and public dataset with rare undesirable real events in oil wells
Pith reviewed 2026-05-19 07:07 UTC · model grok-4.3
The pith
The 3W Dataset has been updated to version 2.0.0 with structural modifications and additional labeled data for detecting rare oil well events.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The 3W Dataset 2.0.0 is a publicly available set of expert-labeled multivariate time series that records both normal oil well operations and rare undesirable events, with structural updates and extra labeled instances added to support more effective machine learning for early detection.
What carries the argument
The 3W Dataset, a collection of multivariate time series labeled by experts to mark undesirable events in oil wells.
If this is right
- Machine learning models can now be trained on a larger set of real rare-event examples to improve detection accuracy.
- New detection methodologies can be tested and compared against prior results using the updated data.
- Digital monitoring products can be built that give operators more advance warning before events cause damage.
- Corrective or mitigating actions become feasible earlier in the event sequence.
Where Pith is reading between the lines
- The expanded set of rare events could serve as a benchmark for testing anomaly detection methods across other industrial sensor streams.
- Structural changes may simplify the process of feeding the data into online learning or streaming analytics systems.
- Users could combine this dataset with synthetic generators to study how class imbalance affects model performance.
- Cross-domain transfer from this oil-well data to similar time-series problems in other sectors becomes more practical.
Load-bearing premise
Expert labels accurately and consistently identify the rare undesirable events across the added time series data.
What would settle it
A side-by-side re-labeling of the newly added time series by independent experts that finds frequent disagreements on event timing or type would show the labels cannot be trusted for model training.
Figures
read the original abstract
In the oil industry, undesirable events in oil wells can cause economic losses, environmental accidents, and human casualties. Solutions based on Artificial Intelligence and Machine Learning for Early Detection of such events have proven valuable for diverse applications across industries. In 2019, recognizing the importance and the lack of public datasets related to undesirable events in oil wells, Petrobras developed and publicly released the first version of the 3W Dataset, which is essentially a set of Multivariate Time Series labeled by experts. Since then, the 3W Dataset has been developed collaboratively and has become a foundational reference for numerous works in the field. This data article describes the current publicly available version of the 3W Dataset, which contains structural modifications and additional labeled data. The detailed description provided encourages and supports the 3W community and new 3W users to improve previous published results and to develop new robust methodologies, digital products and services capable of detecting undesirable events in oil wells with enough anticipation to enable corrective or mitigating actions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes the 3W Dataset 2.0.0, an updated public collection of multivariate time series from oil wells that have been labeled by experts for rare undesirable events. It documents structural modifications relative to the 2019 release and the addition of new labeled instances, with the goal of supporting AI/ML research on early detection to prevent economic losses and safety incidents.
Significance. If the modifications and labels are faithfully documented, the release is significant because it supplies a realistic, publicly available benchmark containing rare but high-impact events. Such data are scarce in the industrial ML literature and directly enable reproducible work on anomaly detection methods that could be deployed to reduce environmental and operational risks in oil production.
major comments (1)
- [Dataset construction and labeling section] The description of the expert labeling process for the newly added time series is insufficiently detailed. Because the central claim of the paper is that the dataset now contains additional reliable labels for rare real events, the manuscript must specify the labeling protocol, inter-rater consistency measures, and any validation steps used for the new instances. This information is load-bearing for users who will treat the labels as ground truth.
minor comments (2)
- [Abstract] The abstract states that structural modifications have been made but does not enumerate them; adding a single sentence listing the main changes (e.g., new sensor channels, revised sampling rates, or file-format updates) would improve immediate clarity.
- [Results or Dataset Statistics] A summary table showing the number of new time series, their total duration, and the distribution of event classes in version 2.0.0 versus 1.0 would help readers quickly gauge the scale of the update.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and the recommendation for minor revision. We agree that greater detail on the labeling process will improve the manuscript and address the comment below by expanding the relevant section in the revised version.
read point-by-point responses
-
Referee: [Dataset construction and labeling section] The description of the expert labeling process for the newly added time series is insufficiently detailed. Because the central claim of the paper is that the dataset now contains additional reliable labels for rare real events, the manuscript must specify the labeling protocol, inter-rater consistency measures, and any validation steps used for the new instances. This information is load-bearing for users who will treat the labels as ground truth.
Authors: We agree that the current description of the expert labeling process for the newly added instances is insufficiently detailed. In the revised manuscript we will expand the Dataset construction and labeling section to include a full account of the labeling protocol (including the sequence of steps followed by domain experts at Petrobras), any inter-rater consistency measures that were applied, and the validation procedures used to confirm the new labels. These additions will allow readers to evaluate the reliability of the labels as ground truth. revision: yes
Circularity Check
No significant circularity; purely descriptive data release
full rationale
The manuscript is a data article that documents structural changes and newly added expert-labeled time series in the public 3W Dataset 2.0.0. It contains no derivations, equations, predictions, parameter fittings, or model-based claims that could reduce to self-definition or fitted inputs. The central claim is enumerative and descriptive, supported by direct description of the dataset contents rather than any internal mathematical construction or self-citation chain. Expert labeling is acknowledged as a domain limitation but does not create a circular dependency within the paper's own logic.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The 3W Dataset is a collection of Multivariate Time Series (MTS) labeled by experts... instances derived from 3 different sources (real, simulated, and hand-drawn)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
All instances were generated with observations recorded every 1 second... fixed sampling frequency of 1 Hz
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An engineering look at the cause of the 2010 Macondo blowout
Turley, J. A. (2014). “An engineering look at the cause of the 2010 Macondo blowout” inIADC/SPE Drilling Conference and Exhibition, Fort Worth, TX, United States. https://doi.org/10.2118/167970-MS
-
[2]
Integrated framework for abnormal event management and process hazards analysis
Dash, S., & Venkatasubramanian, V . (2003). “Integrated framework for abnormal event management and process hazards analysis” inAIChE journal, 49(1), 124-139. https://doi.org/10.1002/aic.690490112
-
[3]
Anomaly Detection Methods for Industrial Applications: A Comparative Study
Panza, M.A., Pota, M., & Esposito, M. (2023). “Anomaly Detection Methods for Industrial Applications: A Comparative Study” inElectronics, 12, 3971. https://doi.org/10.3390/electronics12183971
-
[4]
Data-driven anomaly detection and event log profiling of SCADA alarms
Andrade, J., Rocha, C., Silva, R., Viana, J., Bessa, R., Gouveia, C., Almeida, B., Santos, R., Louro, M., Santos, P., & Ribeiro, A. (2022). “Data-driven anomaly detection and event log profiling of SCADA alarms” inIEEE Access, 10, 1–1. https://doi.org/10.1109/ACCESS.2022.3190398
-
[5]
Two-Phase Defect Detection Using Clustering and Classification Methods
Tran, H. M., Nguyen, T. A., Le, S. T., Huynh, G. V . T., & Lam, T. B. (2022). “Two-Phase Defect Detection Using Clustering and Classification Methods” inREV Journal on Electronics and Communications, 12(1-2). http://dx.doi.org/10.21553/rev-jec.296
-
[6]
A survey on dataset quality in machine learning , journal =
Gong, Y ., Liu, G., Xue, Y ., Li, R., Meng, L. (2023). “A survey on dataset quality in machine learning” in Information and Software Technology, 162. https://doi.org/10.1016/j.infsof.2023.107268
-
[7]
Importance of Datasets for ML and DM
Shark, W. (2020). “Importance of Datasets for ML and DM” inInternet of Things and Big Data Applications: Recent Advances and Challenges, 180, 122
work page 2020
- [8]
-
[9]
A realistic and public dataset with rare undesirable real events in oil wells
Vaz Vargas, R. E., Munaro, C. J., Marques Ciarelli, P., Gonc ¸alves Medeiros, A., Guberfain do Amaral, B., Centurion Barrionuevo, D., Dias de Ara´ujo, J. C., Lins Ribeiro, J., & Pierezan Magalh˜aes, L. (2019). “A realistic and public dataset with rare undesirable real events in oil wells” inJournal of Petroleum Science and Engineering,
work page 2019
-
[10]
https://doi.org/10.1016/j.petrol.2019.106223
-
[11]
Multivariate time series analysis and its applications
Tsay, R. S. (2010). “Multivariate time series analysis and its applications” inIn Analysis of financial time series (pp. 389–465). John Wiley & Sons, Ltd. https://doi.org/10.1002/9780470644560.ch8
-
[12]
Practical Time Series Analysis: Prediction with Statistics and Machine Learning
Nielsen, A. (2019). “Practical Time Series Analysis: Prediction with Statistics and Machine Learning” inFirst Edition. O’Reilly Media, CA
work page 2019
-
[13]
Maximizing information from chemical engineering data sets: Applications to machine learning
Thebelt, A., Wiebe, J., Kronqvist, J., Tsay, C., Misener, R. (2022). “Maximizing information from chemical engineering data sets: Applications to machine learning” inChemical Engineering Science, 252, pp. 117469. https://doi.org/10.1016/j.ces.2022.117469
-
[14]
Papadakis, C., Filandrianos, G., Dimitriou, A., Lymperaiou, M., Thomas, K., Stamou, G., 2025
Pan, S.J., Yang, Q. (2010). “A Survey on Transfer Learning” inIEEE Transactions on Knowledge and Data Engineering, 22:10, pp. 1345-1359. https://doi.org/10.1109/TKDE.2009.191
-
[15]
A Review of Deep Transfer Learning and Recent Advancements
Iman, M., Arabnia, H. R., Rasheed, K. (2023). “A Review of Deep Transfer Learning and Recent Advancements” inTechnologies, 11(2): 40. https://doi.org/10.3390/technologies11020040
-
[16]
Preston-Werner, T. (2013). “Semantic Versioning 2.0.0”. http://semver.org
work page 2013
-
[17]
Petr ´oleo Brasileiro S.A. (Petrobras) (2025). “The 3W Community”. https://github.com/petrobras/3W/tree/main/ community
work page 2025
-
[18]
Petr ´oleo Brasileiro S.A. (Petrobras) (2025). “The 3W Project”. https://github.com/petrobras/3W
work page 2025
-
[19]
Open Lab Module of the Connections Program for Innovation
Petr ´oleo Brasileiro S.A. (Petrobras) (2025). “Open Lab Module of the Connections Program for Innovation”. https://conexoes-inovacao.petrobras.com.br/s/openlab?language=en US
work page 2025
- [20]
-
[21]
Python Software Foundation (2025). “Python Language Reference”. http://www.python.org
work page 2025
-
[22]
Plant Information Management System
Yokogawa (2025). “Plant Information Management System”. https://www.yokogawa.com/solutions/solutions/ connected-intelligence/plant-information-management-system
work page 2025
-
[23]
A VEV A (2025). “A VEV A PI System”. https://www.aveva.com/en/products/aveva-pi-system
work page 2025
-
[24]
SLB (2025). “OLGA”. https://www.slb.com/products-and-services/delivering-digital-at-scale/software/olga
work page 2025
-
[25]
Marine Petroleum (Gas) Engineering and Equipment
Fang, H., Duan, M. (2014). “Marine Petroleum (Gas) Engineering and Equipment” inIn Offshore Operation Fa- cilities. Gulf Professional Publishing, Boston, pp. 341–536. https://doi.org/10.1016/b978-0-12-396977-4.00003- 2
-
[26]
Part IV: Artificial Lift Methods
Guo, B., Liu, X., Tan, X. (2017). “Part IV: Artificial Lift Methods” inPetroleum Production Engineering (Sec- ond Edition). Gulf Professional Publishing, Boston, pp. 513–635. https://doi.org/10.1016/B978-0-12-809374- 0.00041-6
-
[27]
On characterizations of input- to-state stability with respect to compact sets
Sotoodeh, K. (2021). “Introduction to the Subsea Sector of the Oil and Gas Industry” inSubsea Valves and Actu- ators for the Oil and Gas Industry. Gulf Professional Publishing, Boston, pp. 1–36. https://doi.org/10.1016/b978- 0-323-90605-0.00006-2
- [28]
- [29]
-
[30]
Apache Software Foundation (2025) “Apache Parquet”. https://parquet.apache.org
work page 2025
- [31]
-
[32]
J. Alakuijala & Z. Szabadka (2016). “Brotli Compressed Data Format” inInternet Engineering Task Force (IETF). Request for Comments: 7932. https://www.ietf.org/rfc/rfc7932.txt
work page 2016
-
[33]
2020.pandas-dev/pandas: Pandas
The pandas development team (2020). “pandas-dev/pandas: Pandas”. https://doi.org/10.5281/zenodo.3509134
-
[34]
Apache Software Foundation (2004). “Apache License Version 2.0”. https://www.apache.org/licenses/LICENSE-2.0.txt. 21
work page 2004
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.