pith. sign in

arxiv: 2510.18013 · v4 · submitted 2025-10-20 · 💻 cs.SE

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Pith reviewed 2026-05-18 05:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords Jupyter notebooksmachine learningcrash datasetbenchmarkdebuggingPythonreproducibilityKaggle
0
0 comments X

The pith

JunoBench supplies 111 reproducible crashes from real ML Jupyter notebooks along with their fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JunoBench as the first dedicated benchmark of crashes that occur in Python machine learning code inside Jupyter notebooks. It curates 111 examples drawn from public Kaggle notebooks, ensures each one reproduces reliably in a single environment, and attaches verified fixes plus labels that describe both the library involved and any notebook-specific execution-order problems. A reader would care because notebooks are the dominant setting for ML prototyping yet most debugging research still targets ordinary scripts, leaving notebook-unique failure modes unaddressed. The dataset therefore supplies concrete material that future tools can use to detect, localize, diagnose, and repair such crashes.

Core claim

JunoBench is a collection of 111 curated, reproducible crashes taken from public Kaggle notebooks that use TensorFlow/Keras, PyTorch, Scikit-learn and similar libraries. Each entry includes a verified fix, labels that classify crash characteristics, and natural-language diagnostic annotations. All crashes are packaged inside a unified environment that guarantees they can be reproduced on demand, covering both ordinary library errors and the out-of-order cell execution faults that are distinctive to notebooks.

What carries the argument

JunoBench, the benchmark dataset of 111 crashes equipped with reproduction scripts, verified fixes, and multi-level annotations.

Load-bearing premise

The 111 crashes selected from public Kaggle notebooks are representative of typical real-world crashes in ML notebook development and that the curation process introduces no significant selection bias.

What would settle it

A survey or log analysis of crashes from private or non-Kaggle ML notebooks showing that the majority fall into categories or libraries absent from JunoBench.

Figures

Figures reproduced from arXiv: 2510.18013 by D\'aniel Varr\'o, Jos\'e Antonio Hern\'andez L\'opez, Ulf Nilsson, Yiran Wang.

Figure 1
Figure 1. Figure 1: Overview of the benchmark construction process. design each benchmark instance as an independent notebook con￾taining a single, isolated, and reproducible crash. This ensures that each case is self-contained, simplifying evaluation and comparison across automated debugging tools. For each notebook, we maintain three versions: (1) the original notebook as collected, (2) a reproduced version containing minim… view at source ↗
Figure 2
Figure 2. Figure 2: Characteristics of JunoBench. Each bar in (b) and (d) is seg￾mented by libraries (a). “TF/K” stands for “TensorFlow/Keras”. “Mi￾nor libs” include Statsmodels(2), TorchVision(1), and LightGBM(1). This distribution highlights JunoBench’s diverse coverage of challenges in ML notebook development, spanning DL, classical ML, data processing, and visualization libraries, as well as execution order issues unique … view at source ↗
read the original abstract

Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JunoBench, the first benchmark dataset of real-world crashes in Python-based ML Jupyter notebooks. It consists of 111 curated and reproducible crashes with verified fixes sourced from public Kaggle notebooks, covering popular ML libraries (TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific issues such as out-of-order execution errors. The dataset includes rich annotations for crash characteristics and natural-language diagnostic notes, supported by a unified environment to ensure reproducibility, with the aim of facilitating research on bug detection, localization, diagnosis, and repair.

Significance. If the curation and verification processes are shown to be transparent and bias-controlled, JunoBench would fill a notable gap by supplying the first dedicated, reproducible benchmark for ML notebook crashes. Strengths include the unified reproduction environment, verified fixes, and multi-faceted annotations, which directly support downstream work on notebook-specific debugging tools.

major comments (2)
  1. [Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.
  2. [Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).
minor comments (2)
  1. Add summary statistics (e.g., distribution across libraries, crash categories, and notebook lengths) in a table or figure to characterize the 111 instances more quantitatively.
  2. Clarify the annotation schema with explicit definitions and one or two concrete examples for each label type (crash characteristics, diagnostic notes).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of JunoBench's potential to address a gap in notebook debugging research. We address each major comment below with clarifications and revisions to improve transparency.

read point-by-point responses
  1. Referee: [Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.

    Authors: We agree that greater detail on the curation process is necessary to support claims of representativeness and to allow assessment of biases. The original manuscript summarized the process at a high level in Section 3. In the revised version, we have expanded this section to explicitly describe the search strategy (Kaggle API queries using tags for TensorFlow, PyTorch, Scikit-learn and error-related terms), inclusion/exclusion rules (Python 3 notebooks, public availability, presence of a reproducible crash with ML library involvement, exclusion of non-crash or non-ML examples), the initial candidate pool size screened, and quantitative comparisons of crash type distributions against prior notebook bug studies. We also added discussion of bias mitigation steps, such as sampling across competition and non-competition notebooks. revision: yes

  2. Referee: [Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).

    Authors: We acknowledge that the verification steps were presented at a high level and that concrete details are important for the reproducibility claim. In the revised Section 4, we have added explicit protocols: out-of-order execution errors were confirmed by re-executing cells in the original notebook order versus a permuted order within the unified Docker environment and verifying the crash manifests only in the out-of-order case; fixes were validated by applying the patch, re-running the notebook to confirm resolution, and cross-checking in a second independent environment. We have included a step-by-step verification checklist and examples for each category. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset paper with no derivations, equations, or self-referential predictions.

full rationale

This paper introduces JunoBench as a curated collection of 111 reproducible crashes from public Kaggle notebooks, with annotations for crash characteristics and fixes. It contains no mathematical derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. The contribution is the creation of a benchmark artifact itself, which is self-contained and externally verifiable through the provided notebooks and unified reproduction environment. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. Concerns about selection bias or representativeness relate to dataset validity and external benchmarking, not to any circular reduction in a derivation process.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of Kaggle-sourced crashes and the validity of the curation process; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Crashes collected from public Kaggle notebooks represent real-world ML notebook issues
    Dataset is built exclusively from public Kaggle notebooks as stated in the abstract.
  • ad hoc to paper Curated selection of 111 crashes with verified fixes is comprehensive and unbiased
    Abstract claims curation and verification but does not detail selection rules.

pith-pipeline@v0.9.0 · 5682 in / 1361 out tokens · 40617 ms · 2026-05-18T05:39:59.838800+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

  1. [1]

    Taijara Loiola De Santana, Paulo Anselmo Da Mota Silveira Neto, Eduardo San- tana De Almeida, and Iftekhar Ahmed. 2024. Bug Analysis in Jupyter Notebook Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–34. https://doi.org/10.1145/3641539

  2. [2]

    Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timo- fey Bryksin. 2022. A large-scale comparison of Python code in Jupyter notebooks and scripts. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). Association for Computing Machinery, New York, NY, USA, 353–364. https://doi.org/10.1145/3...

  3. [3]

    Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 510–...

  4. [4]

    Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. https://doi.org/10.1145/3377811.3380378

  5. [5]

    Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo, and Paolo Tonella

  6. [6]

    Real Faults in Deep Learning Fault Benchmarks: How Real Are They? arXiv:2412.16336

  7. [7]

    Kaggle. 2025. Kaggle Docker Image GitHub Repository. https://github.com/ Kaggle/docker-python

  8. [8]

    Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A Bug Benchmark of Deep Learning-Related Software. InIEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(Madrid, Spain, 2021-05). IEEE Press, New York, NY, USA, 540–544. https://doi.org/10.1109/MSR52588.2021. 00070

  9. [9]

    Yunkai Liang. 2022. gDefect4DL- A Dataset of General Real-World Deep Learning Program Defects. In2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(Pittsburgh, Pennsylvania, 2022)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 90–94. https://doi.org/10.1145/3510454.3516826

  10. [10]

    Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in Machine Learning-Based Systems: A Faultload Benchmark. Empirical Software Engineering28, 3 (2023), 62. https://doi.org/10.1007/s10664- 023-10291-1

  11. [11]

    23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems

    NASA high-end computing capability. 23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems. NASA. https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for- machine-learning-development-on-nas-systems_576.html Accessed: 2025-05- 15

  12. [12]

    Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, and Foutse Khomh. 2021. Automatic Fault Detection for Deep Learning Programs Using Graph Transformations.ACM Trans. Softw. Eng. Methodol.31, 1, Article 14 (Sept. 2021), 27 pages. https://doi.org/10.1145/3470006

  13. [13]

    Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire

  14. [14]

    2019 IEEE Interna- tional Conference on Multimedia and Expo (ICME), 406–411 (2019) https://doi.org/10

    A Large-Scale Study About Quality and Reproducibility of Jupyter Note- books. In2019 IEEE/ACM 16th International Conference on Mining Software Repos- itories (MSR)(Montreal, QC, Canada, 2019-05). IEEE Press, New York, NY, USA, 507–517. https://doi.org/10.1109/MSR.2019.00077

  15. [15]

    Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(2021-05). IEEE Press, New York, NY, USA, 550–554. https://doi.org/10.1109/MSR52588.2021.00072

  16. [16]

    Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2022. Eliciting Best Practices for Collaboration with Computational Notebooks.Proceedings of the ACM on Human-Computer Interaction6 (2022), 1–41. Issue CSCW1. https://doi.org/10. 1145/3512934

  17. [17]

    Megan Risdal and Timo Bozsolik. 2022. Meta Kaggle. https://doi.org/10.34740/ KAGGLE/DS/9

  18. [18]

    J. Saldana. 2015.The Coding Manual for Qualitative Researchers. SAGE Publica- tions, London, England. https://books.google.se/books?id=jh1iCgAAQBAJ

  19. [19]

    C.B. Seaman. 1999. Qualitative methods in empirical studies of software en- gineering.IEEE Transactions on Software Engineering25, 4 (1999), 557–572. https://doi.org/10.1109/32.799955

  20. [20]

    Kistowski, Jeremy A

    Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. How to Build a Benchmark. InProceedings of the 6th ACM/SPEC International Conference on Performance Engineering(Austin, Texas, USA)(ICPE ’15). Association for Computing Machinery, New York, NY, USA, 333–336. https://doi.org/10.1145/2668930.2688819

  21. [21]

    Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, Spain)(ICSE ’21). IEEE Press, New York, NY, USA, 1622–1633. https://doi.org/10.1109/ICSE43902.2021.00144

  22. [22]

    JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

    Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2025. Source code repository of paper "JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks". Linkoping University. https: //github.com/PELAB-LiU/JunoBench_construct

  23. [23]

    Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2025. JunoBench (Revision ba4fb60). https://doi.org/10.57967/hf/6876

  24. [24]

    Yiran Wang, Willem Meijer, Jose Antonio Hernandez Lopez, Ulf Nilsson, and Daniel Varro. 2025. Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks . , 2181-2196 pages. https://doi.org/ 10.1109/TSE.2025.3574500

  25. [25]

    Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, ES, 2021-05)(ICSE ’21). IEEE Press, New York, NY, USA, 251–262. https://doi.org/10.1109/ICSE43902.2021. 00034

  26. [26]

    Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) (Wuhan, China, 2021-10). IEEE Press, New York, NY, USA, 11–21. https://doi. org/10.1109/issre52982.2021.00014

  27. [27]

    Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(Amsterdam Netherlands, 2018-07-12). Association for Computing Machinery, New York, NY, USA, 129–140. https://doi.org/10.1145/3213846.3213866

  28. [28]

    Furth, Michael Pradel, and Cindy Rubio-González

    Hao-Nan Zhu, Robert M. Furth, Michael Pradel, and Cindy Rubio-González. 2025. From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets. arXiv:2504.17977