JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
Pith reviewed 2026-05-18 05:39 UTC · model grok-4.3
The pith
JunoBench supplies 111 reproducible crashes from real ML Jupyter notebooks along with their fixes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
JunoBench is a collection of 111 curated, reproducible crashes taken from public Kaggle notebooks that use TensorFlow/Keras, PyTorch, Scikit-learn and similar libraries. Each entry includes a verified fix, labels that classify crash characteristics, and natural-language diagnostic annotations. All crashes are packaged inside a unified environment that guarantees they can be reproduced on demand, covering both ordinary library errors and the out-of-order cell execution faults that are distinctive to notebooks.
What carries the argument
JunoBench, the benchmark dataset of 111 crashes equipped with reproduction scripts, verified fixes, and multi-level annotations.
Load-bearing premise
The 111 crashes selected from public Kaggle notebooks are representative of typical real-world crashes in ML notebook development and that the curation process introduces no significant selection bias.
What would settle it
A survey or log analysis of crashes from private or non-Kaggle ML notebooks showing that the majority fall into categories or libraries absent from JunoBench.
Figures
read the original abstract
Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces JunoBench, the first benchmark dataset of real-world crashes in Python-based ML Jupyter notebooks. It consists of 111 curated and reproducible crashes with verified fixes sourced from public Kaggle notebooks, covering popular ML libraries (TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific issues such as out-of-order execution errors. The dataset includes rich annotations for crash characteristics and natural-language diagnostic notes, supported by a unified environment to ensure reproducibility, with the aim of facilitating research on bug detection, localization, diagnosis, and repair.
Significance. If the curation and verification processes are shown to be transparent and bias-controlled, JunoBench would fill a notable gap by supplying the first dedicated, reproducible benchmark for ML notebook crashes. Strengths include the unified reproduction environment, verified fixes, and multi-faceted annotations, which directly support downstream work on notebook-specific debugging tools.
major comments (2)
- [Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.
- [Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).
minor comments (2)
- Add summary statistics (e.g., distribution across libraries, crash categories, and notebook lengths) in a table or figure to characterize the 111 instances more quantitatively.
- Clarify the annotation schema with explicit definitions and one or two concrete examples for each label type (crash characteristics, diagnostic notes).
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and positive assessment of JunoBench's potential to address a gap in notebook debugging research. We address each major comment below with clarifications and revisions to improve transparency.
read point-by-point responses
-
Referee: [Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.
Authors: We agree that greater detail on the curation process is necessary to support claims of representativeness and to allow assessment of biases. The original manuscript summarized the process at a high level in Section 3. In the revised version, we have expanded this section to explicitly describe the search strategy (Kaggle API queries using tags for TensorFlow, PyTorch, Scikit-learn and error-related terms), inclusion/exclusion rules (Python 3 notebooks, public availability, presence of a reproducible crash with ML library involvement, exclusion of non-crash or non-ML examples), the initial candidate pool size screened, and quantitative comparisons of crash type distributions against prior notebook bug studies. We also added discussion of bias mitigation steps, such as sampling across competition and non-competition notebooks. revision: yes
-
Referee: [Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).
Authors: We acknowledge that the verification steps were presented at a high level and that concrete details are important for the reproducibility claim. In the revised Section 4, we have added explicit protocols: out-of-order execution errors were confirmed by re-executing cells in the original notebook order versus a permuted order within the unified Docker environment and verifying the crash manifests only in the out-of-order case; fixes were validated by applying the patch, re-running the notebook to confirm resolution, and cross-checking in a second independent environment. We have included a step-by-step verification checklist and examples for each category. revision: yes
Circularity Check
No circularity: dataset paper with no derivations, equations, or self-referential predictions.
full rationale
This paper introduces JunoBench as a curated collection of 111 reproducible crashes from public Kaggle notebooks, with annotations for crash characteristics and fixes. It contains no mathematical derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. The contribution is the creation of a benchmark artifact itself, which is self-contained and externally verifiable through the provided notebooks and unified reproduction environment. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. Concerns about selection bias or representativeness relate to dataset validity and external benchmarking, not to any circular reduction in a derivation process.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Crashes collected from public Kaggle notebooks represent real-world ML notebook issues
- ad hoc to paper Curated selection of 111 crashes with verified fixes is comprehensive and unbiased
Reference graph
Works this paper leans on
-
[1]
Taijara Loiola De Santana, Paulo Anselmo Da Mota Silveira Neto, Eduardo San- tana De Almeida, and Iftekhar Ahmed. 2024. Bug Analysis in Jupyter Notebook Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–34. https://doi.org/10.1145/3641539
-
[2]
Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timo- fey Bryksin. 2022. A large-scale comparison of Python code in Jupyter notebooks and scripts. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). Association for Computing Machinery, New York, NY, USA, 353–364. https://doi.org/10.1145/3...
-
[3]
Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 510–...
-
[4]
Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. https://doi.org/10.1145/3377811.3380378
-
[5]
Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo, and Paolo Tonella
- [6]
-
[7]
Kaggle. 2025. Kaggle Docker Image GitHub Repository. https://github.com/ Kaggle/docker-python
work page 2025
-
[8]
Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A Bug Benchmark of Deep Learning-Related Software. InIEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(Madrid, Spain, 2021-05). IEEE Press, New York, NY, USA, 540–544. https://doi.org/10.1109/MSR52588.2021. 00070
-
[9]
Yunkai Liang. 2022. gDefect4DL- A Dataset of General Real-World Deep Learning Program Defects. In2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(Pittsburgh, Pennsylvania, 2022)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 90–94. https://doi.org/10.1145/3510454.3516826
-
[10]
Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in Machine Learning-Based Systems: A Faultload Benchmark. Empirical Software Engineering28, 3 (2023), 62. https://doi.org/10.1007/s10664- 023-10291-1
-
[11]
23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems
NASA high-end computing capability. 23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems. NASA. https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for- machine-learning-development-on-nas-systems_576.html Accessed: 2025-05- 15
work page 2025
-
[12]
Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, and Foutse Khomh. 2021. Automatic Fault Detection for Deep Learning Programs Using Graph Transformations.ACM Trans. Softw. Eng. Methodol.31, 1, Article 14 (Sept. 2021), 27 pages. https://doi.org/10.1145/3470006
-
[13]
Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire
-
[14]
A Large-Scale Study About Quality and Reproducibility of Jupyter Note- books. In2019 IEEE/ACM 16th International Conference on Mining Software Repos- itories (MSR)(Montreal, QC, Canada, 2019-05). IEEE Press, New York, NY, USA, 507–517. https://doi.org/10.1109/MSR.2019.00077
-
[15]
Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(2021-05). IEEE Press, New York, NY, USA, 550–554. https://doi.org/10.1109/MSR52588.2021.00072
-
[16]
Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2022. Eliciting Best Practices for Collaboration with Computational Notebooks.Proceedings of the ACM on Human-Computer Interaction6 (2022), 1–41. Issue CSCW1. https://doi.org/10. 1145/3512934
work page 2022
-
[17]
Megan Risdal and Timo Bozsolik. 2022. Meta Kaggle. https://doi.org/10.34740/ KAGGLE/DS/9
work page 2022
-
[18]
J. Saldana. 2015.The Coding Manual for Qualitative Researchers. SAGE Publica- tions, London, England. https://books.google.se/books?id=jh1iCgAAQBAJ
work page 2015
-
[19]
C.B. Seaman. 1999. Qualitative methods in empirical studies of software en- gineering.IEEE Transactions on Software Engineering25, 4 (1999), 557–572. https://doi.org/10.1109/32.799955
-
[20]
Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. How to Build a Benchmark. InProceedings of the 6th ACM/SPEC International Conference on Performance Engineering(Austin, Texas, USA)(ICPE ’15). Association for Computing Machinery, New York, NY, USA, 333–336. https://doi.org/10.1145/2668930.2688819
-
[21]
Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, Spain)(ICSE ’21). IEEE Press, New York, NY, USA, 1622–1633. https://doi.org/10.1109/ICSE43902.2021.00144
-
[22]
JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2025. Source code repository of paper "JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks". Linkoping University. https: //github.com/PELAB-LiU/JunoBench_construct
work page 2025
-
[23]
Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2025. JunoBench (Revision ba4fb60). https://doi.org/10.57967/hf/6876
-
[24]
Yiran Wang, Willem Meijer, Jose Antonio Hernandez Lopez, Ulf Nilsson, and Daniel Varro. 2025. Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks . , 2181-2196 pages. https://doi.org/ 10.1109/TSE.2025.3574500
-
[25]
Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, ES, 2021-05)(ICSE ’21). IEEE Press, New York, NY, USA, 251–262. https://doi.org/10.1109/ICSE43902.2021. 00034
-
[26]
Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) (Wuhan, China, 2021-10). IEEE Press, New York, NY, USA, 11–21. https://doi. org/10.1109/issre52982.2021.00014
-
[27]
Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(Amsterdam Netherlands, 2018-07-12). Association for Computing Machinery, New York, NY, USA, 129–140. https://doi.org/10.1145/3213846.3213866
-
[28]
Furth, Michael Pradel, and Cindy Rubio-González
Hao-Nan Zhu, Robert M. Furth, Michael Pradel, and Cindy Rubio-González. 2025. From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets. arXiv:2504.17977
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.