Recognition: unknown
Where did we fail? -- Reproducing build failures in embedded open source software
Pith reviewed 2026-05-07 09:08 UTC · model grok-4.3
The pith
PhantomRun reconstructs 91.8% of failing CI builds for embedded software using only logs and metadata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PhantomRun supplies a uniform abstraction that retrieves the build log for any commit and re-executes the corresponding build while exposing all artifacts and metadata in consistent form. On 4628 failing CI runs it reconstructed 91.8% of the builds and preserved the original execution outcome in 98% of the cases examined, with mismatches confined to timestamps or small nondeterministic reordering.
What carries the argument
PhantomRun, the unified abstraction layer that standardizes retrieval, storage, and reproduction of heterogeneous CI build logs and metadata from varied runners and toolchains.
If this is right
- Enables large-scale longitudinal studies of CI failures that were previously impossible due to short-lived logs.
- Makes all build artifacts and metadata available in uniform machine-readable format for automated analysis.
- Shows that historical CI reconstruction is feasible at scale for embedded projects with hardware-software constraints.
- Reproduced builds match originals closely enough that only minor nondeterministic differences appear.
Where Pith is reading between the lines
- The same log-to-reproduction pipeline could be applied to non-embedded projects where CI environments also vary across runners.
- Automated tools could use the reproduced builds to isolate root causes of recurring failures without needing the original hardware.
- Projects might adopt PhantomRun-style logging standards to make future failure analysis cheaper.
- Long-term use could reveal whether certain toolchain or board configurations systematically produce more failures.
Load-bearing premise
The information present in CI logs and metadata is sufficient to recreate the original build environments, toolchains, and hardware constraints without additional external details.
What would settle it
A reproduced build that produces a new failure mode or different outcome that cannot be explained by timestamps or known nondeterministic effects.
Figures
read the original abstract
Due to hardware-software co-development in embedded systems, continuous integration (CI) builds frequently fail because of complex cross-compilation, board configurations, and toolchain constraints. Although CI build logs contain valuable diagnostic information, they are short-lived and difficult to reuse due to heterogeneous runners, toolchains, and log formats. To address these challenges, we present PhantomRun, a unified abstraction layer and publicly reusable dataset that standardizes the retrieval, storage, and reproduction of CI build logs and metadata. Across 4628 failing CI runs, we reconstructed 91.8% of builds and preserved execution outcomes in 98% of evaluated cases. PhantomRun provides two core capabilities: retrieving the build log of any commit and faithfully re-executing the corresponding build in a controlled environment. By exposing all build artifacts and metadata in a uniform, machine-readable format, PhantomRun enables reproducible and longitudinal studies of CI failures. An empirical evaluation shows that reproduced builds closely match their originals, typically differing only in timestamps or minor nondeterministic reordering, demonstrating the feasibility of large-scale historical CI reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PhantomRun, a unified abstraction layer and publicly reusable dataset for retrieving, storing, and reproducing CI build logs and metadata from failing builds in embedded open source software. It reports that across 4628 failing CI runs, 91.8% of builds were reconstructed and execution outcomes were preserved in 98% of evaluated cases, with reproduced builds matching originals except for timestamps and minor nondeterminism.
Significance. If the empirical results hold, this provides a valuable public resource and tool for enabling reproducible, longitudinal studies of CI failures in embedded systems, where hardware-software co-development and heterogeneous toolchains make failures particularly hard to diagnose and replicate. The direct testing of reproduction fidelity via the tool and dataset is a strength that supports broader adoption for debugging and research in continuous integration.
major comments (1)
- [Abstract] Abstract: the central claims of 91.8% reconstruction success across 4628 runs and 98% outcome preservation are presented without details on run selection criteria, error handling during reproduction, or potential confounds (e.g., hardware constraints or toolchain variations). These omissions make it difficult to assess whether the figures are robust or generalizable, which is load-bearing for the feasibility demonstration.
minor comments (2)
- [Abstract] The abstract and evaluation could more explicitly define 'reconstructed' and 'preserved execution outcomes' to avoid ambiguity in interpreting the success metrics.
- [Evaluation] Consider adding a table or figure summarizing the distribution of failure types or embedded platforms in the 4628 runs to strengthen the empirical presentation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of PhantomRun and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claims of 91.8% reconstruction success across 4628 runs and 98% outcome preservation are presented without details on run selection criteria, error handling during reproduction, or potential confounds (e.g., hardware constraints or toolchain variations). These omissions make it difficult to assess whether the figures are robust or generalizable, which is load-bearing for the feasibility demonstration.
Authors: We agree that the abstract's brevity omits key methodological details that would aid assessment of robustness. The manuscript addresses these points in full: run selection criteria are specified in Section 3.1 (dataset construction from public CI logs of selected embedded OSS repositories); error handling is described in Section 4.2 (including logging of reconstruction failures due to missing dependencies and automated environment setup); and potential confounds such as hardware constraints and toolchain variations are analyzed in Section 5.3 (with results showing outcome preservation holds under containerized and emulated setups). To improve the abstract, we will revise it to briefly note the dataset scope and evaluation controls. This change will make the central claims easier to evaluate without altering the reported figures. revision: yes
Circularity Check
No significant circularity detected in empirical claims
full rationale
The paper reports empirical success rates for PhantomRun on 4628 external CI runs (91.8% reconstruction, 98% outcome preservation), with outcomes measured by direct comparison to original builds. These metrics are not derived from self-referential definitions, fitted parameters, or equations that reduce to inputs by construction. No uniqueness theorems, ansatzes, or self-citation chains are invoked as load-bearing for the central feasibility demonstration. The results rest on external CI data and tool execution, remaining self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CI build logs and metadata contain sufficient information to faithfully reproduce the original build environment and outcome
invented entities (1)
-
PhantomRun
no independent evidence
Reference graph
Works this paper leans on
-
[2]
Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2022. Lo- gram: Efficient Log Parsing Using $n$n-Gram Dictionaries.IEEE Trans. Software Eng.48, 3 (2022), 879–892. doi:10.1109/TSE.2020.3007554
-
[3]
Carlos Viegas Damásio, Peter Fröhlich, Wolfgang Nejdl, Luís Moniz Pereira, and Michael Schroeder. 2002. Using Extended Logic Programming for Alarm- Correlation in Cellular Phone Networks.Appl. Intell.17, 2 (2002), 187–202. doi:10.1023/A:1016112931442
-
[4]
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thurais- ingham, David Evans, Tal Malkin, and Dongyan Xu (Ed...
-
[5]
Anders Fischer-Nielsen, Zhoulai Fu, Ting Su, and Andrzej Wasowski. 2020. The forgotten case of the dependency bugs: on the example of the robot operating system. InICSE-SEIP 2020: 42nd International Conference on Software Engineering, Software Engineering in Practice, Seoul, South Korea, 27 June - 19 July, 2020, Gregg Rothermel and Doo-Hwan Bae (Eds.). AC...
-
[6]
Han Fu, Sigrid Eldh, Kristian Wiklund, Andreas Ermedahl, and Cyrille Artho. 2022. Prevalence of continuous integration failures in industrial systems with hardware- in-the-loop testing. InIEEE Int. Symposium on Software Reliability Engineering Workshops, ISSRE - Workshops. IEEE, 61–66. doi:10.1109/ISSREW55968.2022.00040
- [7]
-
[8]
Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. InICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, Wei Wang, Hillol Kargupta, Sanjay Ranka, Philip S. Yu, and Xindong Wu (Eds.). IEEE Computer Society, 149–15...
-
[9]
McKee, Jianfeng Zhan, and Ninghui Sun
Xiaoyu Fu, Rui Ren, Sally A. McKee, Jianfeng Zhan, and Ninghui Sun. 2014. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In2014 IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, September 22-26, 2014. IEEE Computer Society, 103–112. doi:10.1109/CLUSTER.2014.6968768
-
[10]
2024.About GitHub Actions
GitHub Documentation. 2024.About GitHub Actions. https://docs.github.com/ en/actions/learn-github-actions/understanding-github-actions Accessed: 2024- 11-10
2024
-
[11]
2024.GitLab CI/CD Pipelines
GitLab Documentation. 2024.GitLab CI/CD Pipelines. https://docs.gitlab.com/ ee/ci/pipelines/ Accessed: 2024-11-10
2024
-
[12]
Mehdi Golzadeh, Alexandre Decan, and Tom Mens. 2022. On the rise and fall of CI services in GitHub. InIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 662–672. doi:10.1109/SANER53432.2022.00084
-
[13]
Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu. 2016. An Evaluation Study on Log Parsing and Its Use in Log Mining. In46th Annual IEEE/IFIP Interna- tional Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France, June 28 - July 1, 2016. IEEE Computer Society, 654–661. doi:10.1109/DSN.2016.66
-
[14]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services, ICWS 2017, Honolulu, HI, USA, June 25-30, 2017, Ilkay Altintas and Shiping Chen (Eds.). IEEE, 33–40. doi:10.1109/ICWS.2017.13
-
[15]
Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. InProceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, No...
-
[16]
Hassan, Parminder Flora, and Gilbert Hamann
Zhen Ming Jiang, Ahmed E. Hassan, Parminder Flora, and Gilbert Hamann. 2008. Abstracting Execution Logs to Execution Events for Enterprise Applications (Short Paper). InProceedings of the Eighth International Conference on Quality Software, QSIC 2008, 12-13 August 2008, Oxford, UK, Hong Zhu (Ed.). IEEE Com- puter Society, 181–186. doi:10.1109/QSIC.2008.50
-
[17]
Noureddine Kerzazi, Foutse Khomh, and Bram Adams. 2014. Why Do Automated Builds Break? An Empirical Study. In30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 41–50. doi:10.1109/ICSME.2014.26
-
[18]
Kareem Khazem, Earl T. Barr, and Petr Hosek. 2018. Making data-driven port- ing decisions with Tuscan. InProceedings of the 27th ACM SIGSOFT Interna- tional Symposium on Software Testing and Analysis, ISSTA 2018, Amsterdam, The Netherlands, July 16-21, 2018, Frank Tip and Eric Bodden (Eds.). ACM, 276–286. doi:10.1145/3213846.3213855
-
[19]
Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2022. UniParser: A Unified Log Parser for Heterogeneous Log Data. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, Frédérique Laforest, Raphaël Troncy, Elena Simpe...
-
[20]
Lucy Ellen Lwakatare, Teemu Karvonen, Tanja Sauvola, Pasi Kuvaja, He- lena Holmström Olsson, Jan Bosch, and Markku Oivo. 2016. Towards DevOps in the embedded systems domain: Why is it so hard?. In49th Hawaii International Conference on System Sciences (HICSS). IEEE, 5437–5446
2016
-
[21]
Adetokunbo Makanju, Nur Zincir-Heywood, and Evangelos E. Milios. 2012. A Lightweight Algorithm for Message Type Extraction in System Application Logs. IEEE Trans. Knowl. Data Eng.24, 11 (2012), 1921–1936. doi:10.1109/TKDE.2011.138
-
[22]
Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian
-
[23]
DeepDelta: learning to repair compilation errors. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.). ACM, 925–936. doi:10.1145/3338906.3340455
-
[24]
Meiyappan Nagappan, Kesheng Wu, and Mladen A. Vouk. 2009. Efficiently Extracting Operational Profiles from Execution Logs Using Suffix Arrays. In ISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 41–50. doi:10.1109/ISSRE.2009.23
-
[25]
Stairway to Heaven
Helena Holmström Olsson, Hiva Alahyari, and Jan Bosch. 2012. Climbing the "Stairway to Heaven"—A Multiple-Case Study Exploring Barriers in the Transi- tion from Agile Development towards Continuous Deployment of Software. In 38th Euromicro Conference on Software Engineering and Advanced Applications. IEEE, 392–399
2012
-
[26]
Nisha Ratti and Parminder Kaur. 2018. A Conceptual Framework for Analysing the Source Code Dependencies. InAdvances in Computer and Computational Sciences, Sanjiv K. Bhatia, Krishn K. Mishra, Shailesh Tiwari, and Vivek Kumar Singh (Eds.). Springer Singapore, Singapore, 333–341
2018
-
[27]
Kirk Rodrigues, Yu Luo, and Ding Yuan. 2021. CLP: Efficient and Scalable Search on Compressed Text Logs. In15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association, 183–198. https://www.usenix.org/ conference/osdi21/presentation/rodrigues
2021
-
[28]
Nuno Saavedra, André Silva, and Martin Monperrus. 2024. GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 1–5. doi:10.1145/3639478.3640023
-
[29]
Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: generating system events from raw textual logs. InProceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, Craig Macdonald, Iadh Ounis, and Ian Ruthven (Eds.). ACM, 785–794. doi:10.1145/2063576.2063690
-
[30]
Risto Vaarandi et al . 2003. A data clustering algorithm for mining patterns from event logs. InProceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764). Ieee, 119–126
2003
-
[31]
Risto Vaarandi and Mauno Pihelgas. 2015. LogCluster - A data clustering and pattern mining algorithm for event logs. In11th International Conference on Network and Service Management, CNSM 2015, Barcelona, Spain, November 9-13, 2015, Mauro Tortonesi, Jürgen Schönwälder, Edmundo Roberto Mauro Madeira, Corinna Schmitt, and Joan Serrat (Eds.). IEEE Computer ...
-
[32]
Pablo Valenzuela-Toledo, Alexandre Bergel, Timo Kehrer, and Oscar Nierstrasz
-
[33]
EGAD: A moldable tool for GitHub Action analysis. In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023. IEEE, 260–264. doi:10.1109/MSR59073.2023.00044
-
[34]
Patterson, and Michael I
Wei Xu, Ling Huang, Armando Fox, David A. Patterson, and Michael I. Jordan
-
[35]
InPro- ceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, Johannes Fürnkranz and Thorsten Joachims (Eds.)
Detecting Large-Scale System Problems by Mining Console Logs. InPro- ceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 37–46. https://icml.cc/Conferences/2010/papers/902.pdf
2010
-
[36]
Farid Zakaria, Thomas R. W. Scogland, Todd Gamblin, and Carlos Maltzahn. 2022. Mapping Out the HPC Dependency Chaos. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, Felix Wolf, Sameer Shende, Candace Culhane, Sadaf R. Alam, and Heike Jagode (Eds.). IEEE, 34:1–34:12. ...
-
[37]
Xu Zhang, Yong Xu, Si Qin, Shilin He, Bo Qiao, Ze Li, Hongyu Zhang, Xukun Li, Yingnong Dang, Qingwei Lin, Murali Chintalapati, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. Onion: identifying incident-indicating logs for cloud systems. InESEC/FSE 2021: 29th ACM Joint European Software Engineering Confer- ence and Symposium on the Foundations of Softwar...
-
[38]
Hao-Nan Zhu, Kevin Z. Guan, Robert M. Furth, and Cindy Rubio-González. 2023. Actionsremaker: Reproducing GITHUB Actions. In45th IEEE/ACM International Conference on Software Engineering: ICSE 2023 Companion Proceedings, Melbourne, Australia, May 14-20, 2023. IEEE, 11–15. doi:10.1109/ICSE-COMPANION58688. 2023.00015
-
[40]
Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019. Tools and benchmarks for automated log parsing. InProceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2019, Montreal, QC, Canada, May 25-31, 2019, Helen Sharp and Mike Whalen (Eds.). IEEE / ACM,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.