JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

D\'aniel Varr\'o; Jos\'e Antonio Hern\'andez L\'opez; Ulf Nilsson; Yiran Wang

arxiv: 2510.18013 · v4 · submitted 2025-10-20 · 💻 cs.SE

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Yiran Wang , Jos\'e Antonio Hern\'andez L\'opez , Ulf Nilsson , D\'aniel Varr\'o This is my paper

Pith reviewed 2026-05-18 05:39 UTC · model grok-4.3

classification 💻 cs.SE

keywords Jupyter notebooksmachine learningcrash datasetbenchmarkdebuggingPythonreproducibilityKaggle

0 comments

The pith

JunoBench supplies 111 reproducible crashes from real ML Jupyter notebooks along with their fixes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces JunoBench as the first dedicated benchmark of crashes that occur in Python machine learning code inside Jupyter notebooks. It curates 111 examples drawn from public Kaggle notebooks, ensures each one reproduces reliably in a single environment, and attaches verified fixes plus labels that describe both the library involved and any notebook-specific execution-order problems. A reader would care because notebooks are the dominant setting for ML prototyping yet most debugging research still targets ordinary scripts, leaving notebook-unique failure modes unaddressed. The dataset therefore supplies concrete material that future tools can use to detect, localize, diagnose, and repair such crashes.

Core claim

JunoBench is a collection of 111 curated, reproducible crashes taken from public Kaggle notebooks that use TensorFlow/Keras, PyTorch, Scikit-learn and similar libraries. Each entry includes a verified fix, labels that classify crash characteristics, and natural-language diagnostic annotations. All crashes are packaged inside a unified environment that guarantees they can be reproduced on demand, covering both ordinary library errors and the out-of-order cell execution faults that are distinctive to notebooks.

What carries the argument

JunoBench, the benchmark dataset of 111 crashes equipped with reproduction scripts, verified fixes, and multi-level annotations.

Load-bearing premise

The 111 crashes selected from public Kaggle notebooks are representative of typical real-world crashes in ML notebook development and that the curation process introduces no significant selection bias.

What would settle it

A survey or log analysis of crashes from private or non-Kaggle ML notebooks showing that the majority fall into categories or libraries absent from JunoBench.

Figures

Figures reproduced from arXiv: 2510.18013 by D\'aniel Varr\'o, Jos\'e Antonio Hern\'andez L\'opez, Ulf Nilsson, Yiran Wang.

**Figure 1.** Figure 1: Overview of the benchmark construction process. design each benchmark instance as an independent notebook containing a single, isolated, and reproducible crash. This ensures that each case is self-contained, simplifying evaluation and comparison across automated debugging tools. For each notebook, we maintain three versions: (1) the original notebook as collected, (2) a reproduced version containing minim… view at source ↗

**Figure 2.** Figure 2: Characteristics of JunoBench. Each bar in (b) and (d) is segmented by libraries (a). “TF/K” stands for “TensorFlow/Keras”. “Minor libs” include Statsmodels(2), TorchVision(1), and LightGBM(1). This distribution highlights JunoBench’s diverse coverage of challenges in ML notebook development, spanning DL, classical ML, data processing, and visualization libraries, as well as execution order issues unique … view at source ↗

read the original abstract

Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JunoBench supplies 111 reproducible crashes from Kaggle ML notebooks with fixes and labels, but the selection rules stay too vague to judge how typical they are.

read the letter

JunoBench puts together 111 crashes drawn from public Kaggle notebooks, each with a verified fix, a unified reproduction environment, and labels for crash type plus natural-language notes. That is the main new piece: a focused collection aimed at notebook ML work rather than general code or non-notebook ML bugs. The authors also call out notebook-specific issues such as out-of-order cell execution, which most existing benchmarks ignore. The reproducibility setup and the annotations look like practical help for anyone building detection or repair tools in this setting. Credit for shipping a ready-to-use dataset instead of just describing one more idea. The curation step is the clear weak point. The abstract states that the crashes are curated and reproducible, yet it gives no search strategy, inclusion rules, total candidate count, or checks against the broader distribution of notebook failures. Kaggle notebooks lean toward competition code that is often cleaned up, so any filter that favors popular libraries or easy-to-reproduce errors could leave out the messier state and dependency problems that show up in ordinary prototyping. That gap makes it hard to treat the set as representative without more evidence. The paper is aimed at researchers who work on debugging and repair for machine-learning notebooks. A reader who needs a concrete test set for tool evaluation will find something usable here, even if the authors later add more selection details. It is worth sending to peer review so referees can press on the methodology and see whether the verification process holds up under closer look.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces JunoBench, the first benchmark dataset of real-world crashes in Python-based ML Jupyter notebooks. It consists of 111 curated and reproducible crashes with verified fixes sourced from public Kaggle notebooks, covering popular ML libraries (TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific issues such as out-of-order execution errors. The dataset includes rich annotations for crash characteristics and natural-language diagnostic notes, supported by a unified environment to ensure reproducibility, with the aim of facilitating research on bug detection, localization, diagnosis, and repair.

Significance. If the curation and verification processes are shown to be transparent and bias-controlled, JunoBench would fill a notable gap by supplying the first dedicated, reproducible benchmark for ML notebook crashes. Strengths include the unified reproduction environment, verified fixes, and multi-faceted annotations, which directly support downstream work on notebook-specific debugging tools.

major comments (2)

[Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.
[Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).

minor comments (2)

Add summary statistics (e.g., distribution across libraries, crash categories, and notebook lengths) in a table or figure to characterize the 111 instances more quantitatively.
Clarify the annotation schema with explicit definitions and one or two concrete examples for each label type (crash characteristics, diagnostic notes).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and positive assessment of JunoBench's potential to address a gap in notebook debugging research. We address each major comment below with clarifications and revisions to improve transparency.

read point-by-point responses

Referee: [Dataset Construction] The central claim that JunoBench supplies a realistic, representative sample of real-world ML notebook crashes rests on the curation process. The manuscript provides insufficient detail on selection methodology, including the search strategy over the Kaggle corpus, inclusion/exclusion rules, size of the initial candidate pool, and any quantitative checks against broader distributions of notebook failures (e.g., §3 or Dataset Construction section). Without these, potential selection biases—such as favoring competition-oriented or easily reproducible errors—cannot be evaluated.

Authors: We agree that greater detail on the curation process is necessary to support claims of representativeness and to allow assessment of biases. The original manuscript summarized the process at a high level in Section 3. In the revised version, we have expanded this section to explicitly describe the search strategy (Kaggle API queries using tags for TensorFlow, PyTorch, Scikit-learn and error-related terms), inclusion/exclusion rules (Python 3 notebooks, public availability, presence of a reproducible crash with ML library involvement, exclusion of non-crash or non-ML examples), the initial candidate pool size screened, and quantitative comparisons of crash type distributions against prior notebook bug studies. We also added discussion of bias mitigation steps, such as sampling across competition and non-competition notebooks. revision: yes
Referee: [Reproducibility and Verification] The verification process for both crashes and fixes is described at a high level (reproducibility via unified environment and verified fixes) but lacks concrete steps, such as how out-of-order execution errors were confirmed or how fixes were validated across environments. This is load-bearing for the reproducibility claim (Abstract and §4).

Authors: We acknowledge that the verification steps were presented at a high level and that concrete details are important for the reproducibility claim. In the revised Section 4, we have added explicit protocols: out-of-order execution errors were confirmed by re-executing cells in the original notebook order versus a permuted order within the unified Docker environment and verifying the crash manifests only in the out-of-order case; fixes were validated by applying the patch, re-running the notebook to confirm resolution, and cross-checking in a second independent environment. We have included a step-by-step verification checklist and examples for each category. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset paper with no derivations, equations, or self-referential predictions.

full rationale

This paper introduces JunoBench as a curated collection of 111 reproducible crashes from public Kaggle notebooks, with annotations for crash characteristics and fixes. It contains no mathematical derivations, equations, fitted parameters, or predictive claims that could reduce to inputs by construction. The contribution is the creation of a benchmark artifact itself, which is self-contained and externally verifiable through the provided notebooks and unified reproduction environment. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. Concerns about selection bias or representativeness relate to dataset validity and external benchmarking, not to any circular reduction in a derivation process.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the representativeness of Kaggle-sourced crashes and the validity of the curation process; no free parameters or invented entities are introduced.

axioms (2)

domain assumption Crashes collected from public Kaggle notebooks represent real-world ML notebook issues
Dataset is built exclusively from public Kaggle notebooks as stated in the abstract.
ad hoc to paper Curated selection of 111 crashes with verified fixes is comprehensive and unbiased
Abstract claims curation and verification but does not detail selection rules.

pith-pipeline@v0.9.0 · 5682 in / 1361 out tokens · 40617 ms · 2026-05-18T05:39:59.838800+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Taijara Loiola De Santana, Paulo Anselmo Da Mota Silveira Neto, Eduardo San- tana De Almeida, and Iftekhar Ahmed. 2024. Bug Analysis in Jupyter Notebook Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–34. https://doi.org/10.1145/3641539

work page doi:10.1145/3641539 2024
[2]

Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timo- fey Bryksin. 2022. A large-scale comparison of Python code in Jupyter notebooks and scripts. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). Association for Computing Machinery, New York, NY, USA, 353–364. https://doi.org/10.1145/3...

work page doi:10.1145/3524842.3528447 2022
[3]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 510–...

work page doi:10.1145/3338906.3338955 2019
[4]

Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. https://doi.org/10.1145/3377811.3380378

work page doi:10.1145/3377811.3380378 2020
[5]

Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo, and Paolo Tonella

work page
[6]

Real Faults in Deep Learning Fault Benchmarks: How Real Are They? arXiv:2412.16336

work page arXiv
[7]

Kaggle. 2025. Kaggle Docker Image GitHub Repository. https://github.com/ Kaggle/docker-python

work page 2025
[8]

Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A Bug Benchmark of Deep Learning-Related Software. InIEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(Madrid, Spain, 2021-05). IEEE Press, New York, NY, USA, 540–544. https://doi.org/10.1109/MSR52588.2021. 00070

work page doi:10.1109/msr52588.2021 2021
[9]

Yunkai Liang. 2022. gDefect4DL- A Dataset of General Real-World Deep Learning Program Defects. In2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(Pittsburgh, Pennsylvania, 2022)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 90–94. https://doi.org/10.1145/3510454.3516826

work page doi:10.1145/3510454.3516826 2022
[10]

Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in Machine Learning-Based Systems: A Faultload Benchmark. Empirical Software Engineering28, 3 (2023), 62. https://doi.org/10.1007/s10664- 023-10291-1

work page doi:10.1007/s10664- 2023
[11]

23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems

NASA high-end computing capability. 23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems. NASA. https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for- machine-learning-development-on-nas-systems_576.html Accessed: 2025-05- 15

work page 2025
[12]

Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, and Foutse Khomh. 2021. Automatic Fault Detection for Deep Learning Programs Using Graph Transformations.ACM Trans. Softw. Eng. Methodol.31, 1, Article 14 (Sept. 2021), 27 pages. https://doi.org/10.1145/3470006

work page doi:10.1145/3470006 2021
[13]

Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire

work page
[14]

2019 IEEE Interna- tional Conference on Multimedia and Expo (ICME), 406–411 (2019) https://doi.org/10

A Large-Scale Study About Quality and Reproducibility of Jupyter Note- books. In2019 IEEE/ACM 16th International Conference on Mining Software Repos- itories (MSR)(Montreal, QC, Canada, 2019-05). IEEE Press, New York, NY, USA, 507–517. https://doi.org/10.1109/MSR.2019.00077

work page doi:10.1109/msr.2019.00077 2019
[15]

Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(2021-05). IEEE Press, New York, NY, USA, 550–554. https://doi.org/10.1109/MSR52588.2021.00072

work page doi:10.1109/msr52588.2021.00072 2021
[16]

Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2022. Eliciting Best Practices for Collaboration with Computational Notebooks.Proceedings of the ACM on Human-Computer Interaction6 (2022), 1–41. Issue CSCW1. https://doi.org/10. 1145/3512934

work page 2022
[17]

Megan Risdal and Timo Bozsolik. 2022. Meta Kaggle. https://doi.org/10.34740/ KAGGLE/DS/9

work page 2022
[18]

J. Saldana. 2015.The Coding Manual for Qualitative Researchers. SAGE Publica- tions, London, England. https://books.google.se/books?id=jh1iCgAAQBAJ

work page 2015
[19]

C.B. Seaman. 1999. Qualitative methods in empirical studies of software en- gineering.IEEE Transactions on Software Engineering25, 4 (1999), 557–572. https://doi.org/10.1109/32.799955

work page doi:10.1109/32.799955 1999
[20]

Kistowski, Jeremy A

Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. How to Build a Benchmark. InProceedings of the 6th ACM/SPEC International Conference on Performance Engineering(Austin, Texas, USA)(ICPE ’15). Association for Computing Machinery, New York, NY, USA, 333–336. https://doi.org/10.1145/2668930.2688819

work page doi:10.1145/2668930.2688819 2015
[21]

Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, Spain)(ICSE ’21). IEEE Press, New York, NY, USA, 1622–1633. https://doi.org/10.1109/ICSE43902.2021.00144

work page doi:10.1109/icse43902.2021.00144 2021
[22]

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2025. Source code repository of paper "JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks". Linkoping University. https: //github.com/PELAB-LiU/JunoBench_construct

work page 2025
[23]

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2025. JunoBench (Revision ba4fb60). https://doi.org/10.57967/hf/6876

work page doi:10.57967/hf/6876 2025
[24]

Yiran Wang, Willem Meijer, Jose Antonio Hernandez Lopez, Ulf Nilsson, and Daniel Varro. 2025. Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks . , 2181-2196 pages. https://doi.org/ 10.1109/TSE.2025.3574500

work page doi:10.1109/tse.2025.3574500 2025
[25]

Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, ES, 2021-05)(ICSE ’21). IEEE Press, New York, NY, USA, 251–262. https://doi.org/10.1109/ICSE43902.2021. 00034

work page doi:10.1109/icse43902.2021 2021
[26]

Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) (Wuhan, China, 2021-10). IEEE Press, New York, NY, USA, 11–21. https://doi. org/10.1109/issre52982.2021.00014

work page doi:10.1109/issre52982.2021.00014 2021
[27]

Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(Amsterdam Netherlands, 2018-07-12). Association for Computing Machinery, New York, NY, USA, 129–140. https://doi.org/10.1145/3213846.3213866

work page doi:10.1145/3213846.3213866 2018
[28]

Furth, Michael Pradel, and Cindy Rubio-González

Hao-Nan Zhu, Robert M. Furth, Michael Pradel, and Cindy Rubio-González. 2025. From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets. arXiv:2504.17977

work page arXiv 2025

[1] [1]

Taijara Loiola De Santana, Paulo Anselmo Da Mota Silveira Neto, Eduardo San- tana De Almeida, and Iftekhar Ahmed. 2024. Bug Analysis in Jupyter Notebook Projects: An Empirical Study.ACM Transactions on Software Engineering and Methodology33, 4 (2024), 1–34. https://doi.org/10.1145/3641539

work page doi:10.1145/3641539 2024

[2] [2]

Konstantin Grotov, Sergey Titov, Vladimir Sotnikov, Yaroslav Golubev, and Timo- fey Bryksin. 2022. A large-scale comparison of Python code in Jupyter notebooks and scripts. InProceedings of the 19th International Conference on Mining Software Repositories (MSR ’22). Association for Computing Machinery, New York, NY, USA, 353–364. https://doi.org/10.1145/3...

work page doi:10.1145/3524842.3528447 2022

[3] [3]

Md Johirul Islam, Giang Nguyen, Rangeet Pan, and Hridesh Rajan. 2019. A comprehensive study on deep learning bug characteristics. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 510–...

work page doi:10.1145/3338906.3338955 2019

[4] [4]

Md Johirul Islam, Rangeet Pan, Giang Nguyen, and Hridesh Rajan. 2020. Repairing deep neural networks: fix patterns and challenges. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering(Seoul, South Korea)(ICSE ’20). Association for Computing Machinery, New York, NY, USA, 1135–1146. https://doi.org/10.1145/3377811.3380378

work page doi:10.1145/3377811.3380378 2020

[5] [5]

Gunel Jahangirova, Nargiz Humbatova, Jinhan Kim, Shin Yoo, and Paolo Tonella

work page

[6] [6]

Real Faults in Deep Learning Fault Benchmarks: How Real Are They? arXiv:2412.16336

work page arXiv

[7] [7]

Kaggle. 2025. Kaggle Docker Image GitHub Repository. https://github.com/ Kaggle/docker-python

work page 2025

[8] [8]

Misoo Kim, Youngkyoung Kim, and Eunseok Lee. 2021. Denchmark: A Bug Benchmark of Deep Learning-Related Software. InIEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(Madrid, Spain, 2021-05). IEEE Press, New York, NY, USA, 540–544. https://doi.org/10.1109/MSR52588.2021. 00070

work page doi:10.1109/msr52588.2021 2021

[9] [9]

Yunkai Liang. 2022. gDefect4DL- A Dataset of General Real-World Deep Learning Program Defects. In2022 IEEE/ACM 44th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)(Pittsburgh, Pennsylvania, 2022)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 90–94. https://doi.org/10.1145/3510454.3516826

work page doi:10.1145/3510454.3516826 2022

[10] [10]

Mohammad Mehdi Morovati, Amin Nikanjam, Foutse Khomh, and Zhen Ming Jiang. 2023. Bugs in Machine Learning-Based Systems: A Faultload Benchmark. Empirical Software Engineering28, 3 (2023), 62. https://doi.org/10.1007/s10664- 023-10291-1

work page doi:10.1007/s10664- 2023

[11] [11]

23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems

NASA high-end computing capability. 23 Apr, 2025.Using Jupyter Notebook for Machine Learning Development on NAS Systems. NASA. https://www.nas.nasa.gov/hecc/support/kb/using-jupyter-notebook-for- machine-learning-development-on-nas-systems_576.html Accessed: 2025-05- 15

work page 2025

[12] [12]

Amin Nikanjam, Houssem Ben Braiek, Mohammad Mehdi Morovati, and Foutse Khomh. 2021. Automatic Fault Detection for Deep Learning Programs Using Graph Transformations.ACM Trans. Softw. Eng. Methodol.31, 1, Article 14 (Sept. 2021), 27 pages. https://doi.org/10.1145/3470006

work page doi:10.1145/3470006 2021

[13] [13]

Joao Felipe Pimentel, Leonardo Murta, Vanessa Braganholo, and Juliana Freire

work page

[14] [14]

2019 IEEE Interna- tional Conference on Multimedia and Expo (ICME), 406–411 (2019) https://doi.org/10

A Large-Scale Study About Quality and Reproducibility of Jupyter Note- books. In2019 IEEE/ACM 16th International Conference on Mining Software Repos- itories (MSR)(Montreal, QC, Canada, 2019-05). IEEE Press, New York, NY, USA, 507–517. https://doi.org/10.1109/MSR.2019.00077

work page doi:10.1109/msr.2019.00077 2019

[15] [15]

Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2021. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR)(2021-05). IEEE Press, New York, NY, USA, 550–554. https://doi.org/10.1109/MSR52588.2021.00072

work page doi:10.1109/msr52588.2021.00072 2021

[16] [16]

Luigi Quaranta, Fabio Calefato, and Filippo Lanubile. 2022. Eliciting Best Practices for Collaboration with Computational Notebooks.Proceedings of the ACM on Human-Computer Interaction6 (2022), 1–41. Issue CSCW1. https://doi.org/10. 1145/3512934

work page 2022

[17] [17]

Megan Risdal and Timo Bozsolik. 2022. Meta Kaggle. https://doi.org/10.34740/ KAGGLE/DS/9

work page 2022

[18] [18]

J. Saldana. 2015.The Coding Manual for Qualitative Researchers. SAGE Publica- tions, London, England. https://books.google.se/books?id=jh1iCgAAQBAJ

work page 2015

[19] [19]

C.B. Seaman. 1999. Qualitative methods in empirical studies of software en- gineering.IEEE Transactions on Software Engineering25, 4 (1999), 557–572. https://doi.org/10.1109/32.799955

work page doi:10.1109/32.799955 1999

[20] [20]

Kistowski, Jeremy A

Jóakim v. Kistowski, Jeremy A. Arnold, Karl Huppler, Klaus-Dieter Lange, John L. Henning, and Paul Cao. 2015. How to Build a Benchmark. InProceedings of the 6th ACM/SPEC International Conference on Performance Engineering(Austin, Texas, USA)(ICPE ’15). Association for Computing Machinery, New York, NY, USA, 333–336. https://doi.org/10.1145/2668930.2688819

work page doi:10.1145/2668930.2688819 2015

[21] [21]

Jiawei Wang, Li Li, and Andreas Zeller. 2021. Restoring Execution Environments of Jupyter Notebooks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, Spain)(ICSE ’21). IEEE Press, New York, NY, USA, 1622–1633. https://doi.org/10.1109/ICSE43902.2021.00144

work page doi:10.1109/icse43902.2021.00144 2021

[22] [22]

JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Daniel Varro. 2025. Source code repository of paper "JunoBench: A Benchmark Dataset of Crashes in Python Machine Learning Jupyter Notebooks". Linkoping University. https: //github.com/PELAB-LiU/JunoBench_construct

work page 2025

[23] [23]

Yiran Wang, José Antonio Hernández López, Ulf Nilsson, and Dániel Varró. 2025. JunoBench (Revision ba4fb60). https://doi.org/10.57967/hf/6876

work page doi:10.57967/hf/6876 2025

[24] [24]

Yiran Wang, Willem Meijer, Jose Antonio Hernandez Lopez, Ulf Nilsson, and Daniel Varro. 2025. Why Do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks . , 2181-2196 pages. https://doi.org/ 10.1109/TSE.2025.3574500

work page doi:10.1109/tse.2025.3574500 2025

[25] [25]

Mohammad Wardat, Wei Le, and Hridesh Rajan. 2021. DeepLocalize: Fault Localization for Deep Neural Networks. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE)(Madrid, ES, 2021-05)(ICSE ’21). IEEE Press, New York, NY, USA, 251–262. https://doi.org/10.1109/ICSE43902.2021. 00034

work page doi:10.1109/icse43902.2021 2021

[26] [26]

Dangwei Wu, Beijun Shen, Yuting Chen, He Jiang, and Lei Qiao. 2021. Tensfa: Detecting and Repairing Tensor Shape Faults in Deep Learning Systems. In2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE) (Wuhan, China, 2021-10). IEEE Press, New York, NY, USA, 11–21. https://doi. org/10.1109/issre52982.2021.00014

work page doi:10.1109/issre52982.2021.00014 2021

[27] [27]

Yuhao Zhang, Yifan Chen, Shing-Chi Cheung, Yingfei Xiong, and Lu Zhang. 2018. An Empirical Study on TensorFlow Program Bugs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(Amsterdam Netherlands, 2018-07-12). Association for Computing Machinery, New York, NY, USA, 129–140. https://doi.org/10.1145/3213846.3213866

work page doi:10.1145/3213846.3213866 2018

[28] [28]

Furth, Michael Pradel, and Cindy Rubio-González

Hao-Nan Zhu, Robert M. Furth, Michael Pradel, and Cindy Rubio-González. 2025. From Bugs to Benchmarks: A Comprehensive Survey of Software Defect Datasets. arXiv:2504.17977

work page arXiv 2025