arxiv: 2604.27075 · v1 · submitted 2026-04-29 · 💻 cs.SE

Recognition: unknown

Where did we fail? -- Reproducing build failures in embedded open source software

Han Fu , Andreas Ermedahl , Sigrid Eldh , Kristian Wiklund , Philipp Haller , Cyrille Artho

Authors on Pith no claims yet

Pith reviewed 2026-05-07 09:08 UTC · model grok-4.3

classification 💻 cs.SE

keywords embedded systemscontinuous integrationbuild failuresreproducibilityCI logscross-compilationPhantomRunhardware-software co-development

0 comments

The pith

PhantomRun reconstructs 91.8% of failing CI builds for embedded software using only logs and metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PhantomRun as a standardized way to pull build logs and metadata from any commit in embedded open-source projects and then re-run those builds in a controlled environment. Embedded CI failures often stem from cross-compilation, board setups, and toolchain details that are hard to recover once logs disappear. The work shows this reconstruction succeeds on 91.8% of 4628 failing runs while keeping the original execution outcome in 98% of tested cases, with differences limited to timestamps or minor nondeterminism. A sympathetic reader would care because it turns transient CI data into reusable, machine-readable records for studying failure patterns over time.

Core claim

PhantomRun supplies a uniform abstraction that retrieves the build log for any commit and re-executes the corresponding build while exposing all artifacts and metadata in consistent form. On 4628 failing CI runs it reconstructed 91.8% of the builds and preserved the original execution outcome in 98% of the cases examined, with mismatches confined to timestamps or small nondeterministic reordering.

What carries the argument

PhantomRun, the unified abstraction layer that standardizes retrieval, storage, and reproduction of heterogeneous CI build logs and metadata from varied runners and toolchains.

If this is right

Enables large-scale longitudinal studies of CI failures that were previously impossible due to short-lived logs.
Makes all build artifacts and metadata available in uniform machine-readable format for automated analysis.
Shows that historical CI reconstruction is feasible at scale for embedded projects with hardware-software constraints.
Reproduced builds match originals closely enough that only minor nondeterministic differences appear.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same log-to-reproduction pipeline could be applied to non-embedded projects where CI environments also vary across runners.
Automated tools could use the reproduced builds to isolate root causes of recurring failures without needing the original hardware.
Projects might adopt PhantomRun-style logging standards to make future failure analysis cheaper.
Long-term use could reveal whether certain toolchain or board configurations systematically produce more failures.

Load-bearing premise

The information present in CI logs and metadata is sufficient to recreate the original build environments, toolchains, and hardware constraints without additional external details.

What would settle it

A reproduced build that produces a new failure mode or different outcome that cannot be explained by timestamps or known nondeterministic effects.

Figures

Figures reproduced from arXiv: 2604.27075 by Andreas Ermedahl, Cyrille Artho, Han Fu, Kristian Wiklund, Philipp Haller, Sigrid Eldh.

**Figure 2.** Figure 2: CI reconstruction and log parsing 3.2 Properties of Reconstructed CI Logs To support reproducible and comparative analysis, CI reconstruction is guided by a set of properties that characterize the validity of reconstructed build logs as empirical data view at source ↗

read the original abstract

Due to hardware-software co-development in embedded systems, continuous integration (CI) builds frequently fail because of complex cross-compilation, board configurations, and toolchain constraints. Although CI build logs contain valuable diagnostic information, they are short-lived and difficult to reuse due to heterogeneous runners, toolchains, and log formats. To address these challenges, we present PhantomRun, a unified abstraction layer and publicly reusable dataset that standardizes the retrieval, storage, and reproduction of CI build logs and metadata. Across 4628 failing CI runs, we reconstructed 91.8% of builds and preserved execution outcomes in 98% of evaluated cases. PhantomRun provides two core capabilities: retrieving the build log of any commit and faithfully re-executing the corresponding build in a controlled environment. By exposing all build artifacts and metadata in a uniform, machine-readable format, PhantomRun enables reproducible and longitudinal studies of CI failures. An empirical evaluation shows that reproduced builds closely match their originals, typically differing only in timestamps or minor nondeterministic reordering, demonstrating the feasibility of large-scale historical CI reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PhantomRun, a unified abstraction layer and publicly reusable dataset for retrieving, storing, and reproducing CI build logs and metadata from failing builds in embedded open source software. It reports that across 4628 failing CI runs, 91.8% of builds were reconstructed and execution outcomes were preserved in 98% of evaluated cases, with reproduced builds matching originals except for timestamps and minor nondeterminism.

Significance. If the empirical results hold, this provides a valuable public resource and tool for enabling reproducible, longitudinal studies of CI failures in embedded systems, where hardware-software co-development and heterogeneous toolchains make failures particularly hard to diagnose and replicate. The direct testing of reproduction fidelity via the tool and dataset is a strength that supports broader adoption for debugging and research in continuous integration.

major comments (1)

[Abstract] Abstract: the central claims of 91.8% reconstruction success across 4628 runs and 98% outcome preservation are presented without details on run selection criteria, error handling during reproduction, or potential confounds (e.g., hardware constraints or toolchain variations). These omissions make it difficult to assess whether the figures are robust or generalizable, which is load-bearing for the feasibility demonstration.

minor comments (2)

[Abstract] The abstract and evaluation could more explicitly define 'reconstructed' and 'preserved execution outcomes' to avoid ambiguity in interpreting the success metrics.
[Evaluation] Consider adding a table or figure summarizing the distribution of failure types or embedded platforms in the 4628 runs to strengthen the empirical presentation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of PhantomRun and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 91.8% reconstruction success across 4628 runs and 98% outcome preservation are presented without details on run selection criteria, error handling during reproduction, or potential confounds (e.g., hardware constraints or toolchain variations). These omissions make it difficult to assess whether the figures are robust or generalizable, which is load-bearing for the feasibility demonstration.

Authors: We agree that the abstract's brevity omits key methodological details that would aid assessment of robustness. The manuscript addresses these points in full: run selection criteria are specified in Section 3.1 (dataset construction from public CI logs of selected embedded OSS repositories); error handling is described in Section 4.2 (including logging of reconstruction failures due to missing dependencies and automated environment setup); and potential confounds such as hardware constraints and toolchain variations are analyzed in Section 5.3 (with results showing outcome preservation holds under containerized and emulated setups). To improve the abstract, we will revise it to briefly note the dataset scope and evaluation controls. This change will make the central claims easier to evaluate without altering the reported figures. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in empirical claims

full rationale

The paper reports empirical success rates for PhantomRun on 4628 external CI runs (91.8% reconstruction, 98% outcome preservation), with outcomes measured by direct comparison to original builds. These metrics are not derived from self-referential definitions, fitted parameters, or equations that reduce to inputs by construction. No uniqueness theorems, ansatzes, or self-citation chains are invoked as load-bearing for the central feasibility demonstration. The results rest on external CI data and tool execution, remaining self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The reconstruction success rates rest on the domain assumption that CI logs contain enough metadata to recreate build environments without original hardware access; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption CI build logs and metadata contain sufficient information to faithfully reproduce the original build environment and outcome
Central to the 91.8% reconstruction claim and the ability to preserve execution outcomes.

invented entities (1)

PhantomRun no independent evidence
purpose: Unified abstraction layer for standardized retrieval, storage, and reproduction of CI build logs
New system and dataset introduced to solve the heterogeneity problem.

pith-pipeline@v0.9.0 · 5501 in / 1275 out tokens · 55437 ms · 2026-05-07T09:08:57.866554+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 27 canonical work pages

[2]

Hetong Dai, Heng Li, Che-Shao Chen, Weiyi Shang, and Tse-Hsun Chen. 2022. Lo- gram: Efficient Log Parsing Using $n$n-Gram Dictionaries.IEEE Trans. Software Eng.48, 3 (2022), 879–892. doi:10.1109/TSE.2020.3007554

work page doi:10.1109/tse.2020.3007554 2022
[3]

Carlos Viegas Damásio, Peter Fröhlich, Wolfgang Nejdl, Luís Moniz Pereira, and Michael Schroeder. 2002. Using Extended Logic Programming for Alarm- Correlation in Cellular Phone Networks.Appl. Intell.17, 2 (2002), 187–202. doi:10.1023/A:1016112931442

work page doi:10.1023/a:1016112931442 2002
[4]

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. 2017. DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017, Bhavani Thurais- ingham, David Evans, Tal Malkin, and Dongyan Xu (Ed...

work page doi:10.1145/3133956.3134015 2017
[5]

Anders Fischer-Nielsen, Zhoulai Fu, Ting Su, and Andrzej Wasowski. 2020. The forgotten case of the dependency bugs: on the example of the robot operating system. InICSE-SEIP 2020: 42nd International Conference on Software Engineering, Software Engineering in Practice, Seoul, South Korea, 27 June - 19 July, 2020, Gregg Rothermel and Doo-Hwan Bae (Eds.). AC...

work page doi:10.1145/3377813.3381364 2020
[6]

Han Fu, Sigrid Eldh, Kristian Wiklund, Andreas Ermedahl, and Cyrille Artho. 2022. Prevalence of continuous integration failures in industrial systems with hardware- in-the-loop testing. InIEEE Int. Symposium on Software Reliability Engineering Workshops, ISSRE - Workshops. IEEE, 61–66. doi:10.1109/ISSREW55968.2022.00040

work page doi:10.1109/issrew55968.2022.00040 2022
[7]

Han Fu, Andreas Ermedahl, Sigrid Eldh, Kristian Wiklund, Philipp Haller, and Cyrille Artho. 2026. PhantomRun: Auto Repair of Compilation Errors in Embed- ded Open Source Software. arXiv:2602.20284 https://arxiv.org/abs/2602.20284

work page arXiv 2026
[8]

Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution Anomaly Detection in Distributed Systems through Unstructured Log Analysis. InICDM 2009, The Ninth IEEE International Conference on Data Mining, Miami, Florida, USA, 6-9 December 2009, Wei Wang, Hillol Kargupta, Sanjay Ranka, Philip S. Yu, and Xindong Wu (Eds.). IEEE Computer Society, 149–15...

work page doi:10.1109/icdm.2009.60 2009
[9]

McKee, Jianfeng Zhan, and Ninghui Sun

Xiaoyu Fu, Rui Ren, Sally A. McKee, Jianfeng Zhan, and Ninghui Sun. 2014. Digging deeper into cluster system logs for failure prediction and root cause diagnosis. In2014 IEEE International Conference on Cluster Computing, CLUSTER 2014, Madrid, Spain, September 22-26, 2014. IEEE Computer Society, 103–112. doi:10.1109/CLUSTER.2014.6968768

work page doi:10.1109/cluster.2014.6968768 2014
[10]

2024.About GitHub Actions

GitHub Documentation. 2024.About GitHub Actions. https://docs.github.com/ en/actions/learn-github-actions/understanding-github-actions Accessed: 2024- 11-10

2024
[11]

2024.GitLab CI/CD Pipelines

GitLab Documentation. 2024.GitLab CI/CD Pipelines. https://docs.gitlab.com/ ee/ci/pipelines/ Accessed: 2024-11-10

2024
[12]

Mehdi Golzadeh, Alexandre Decan, and Tom Mens. 2022. On the rise and fall of CI services in GitHub. InIEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2022, Honolulu, HI, USA, March 15-18, 2022. IEEE, 662–672. doi:10.1109/SANER53432.2022.00084

work page doi:10.1109/saner53432.2022.00084 2022
[13]

Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R. Lyu. 2016. An Evaluation Study on Log Parsing and Its Use in Log Mining. In46th Annual IEEE/IFIP Interna- tional Conference on Dependable Systems and Networks, DSN 2016, Toulouse, France, June 28 - July 1, 2016. IEEE Computer Society, 654–661. doi:10.1109/DSN.2016.66

work page doi:10.1109/dsn.2016.66 2016
[14]

Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services, ICWS 2017, Honolulu, HI, USA, June 25-30, 2017, Ilkay Altintas and Shiping Chen (Eds.). IEEE, 33–40. doi:10.1109/ICWS.2017.13

work page doi:10.1109/icws.2017.13 2017
[15]

Lyu, and Dongmei Zhang

Shilin He, Qingwei Lin, Jian-Guang Lou, Hongyu Zhang, Michael R. Lyu, and Dongmei Zhang. 2018. Identifying impactful service system problems via log analysis. InProceedings of the 2018 ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2018, Lake Buena Vista, FL, USA, No...

work page arXiv 2018
[16]

Hassan, Parminder Flora, and Gilbert Hamann

Zhen Ming Jiang, Ahmed E. Hassan, Parminder Flora, and Gilbert Hamann. 2008. Abstracting Execution Logs to Execution Events for Enterprise Applications (Short Paper). InProceedings of the Eighth International Conference on Quality Software, QSIC 2008, 12-13 August 2008, Oxford, UK, Hong Zhu (Ed.). IEEE Com- puter Society, 181–186. doi:10.1109/QSIC.2008.50

work page doi:10.1109/qsic.2008.50 2008
[17]

Noureddine Kerzazi, Foutse Khomh, and Bram Adams. 2014. Why Do Automated Builds Break? An Empirical Study. In30th IEEE International Conference on Software Maintenance and Evolution, Victoria, BC, Canada, September 29 - October 3, 2014. IEEE Computer Society, 41–50. doi:10.1109/ICSME.2014.26

work page doi:10.1109/icsme.2014.26 2014
[18]

Barr, and Petr Hosek

Kareem Khazem, Earl T. Barr, and Petr Hosek. 2018. Making data-driven port- ing decisions with Tuscan. InProceedings of the 27th ACM SIGSOFT Interna- tional Symposium on Software Testing and Analysis, ISSTA 2018, Amsterdam, The Netherlands, July 16-21, 2018, Frank Tip and Eric Bodden (Eds.). ACM, 276–286. doi:10.1145/3213846.3213855

work page doi:10.1145/3213846.3213855 2018
[19]

Yudong Liu, Xu Zhang, Shilin He, Hongyu Zhang, Liqun Li, Yu Kang, Yong Xu, Minghua Ma, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, and Dongmei Zhang. 2022. UniParser: A Unified Log Parser for Heterogeneous Log Data. In WWW ’22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25 - 29, 2022, Frédérique Laforest, Raphaël Troncy, Elena Simpe...

work page arXiv 2022
[20]

Lucy Ellen Lwakatare, Teemu Karvonen, Tanja Sauvola, Pasi Kuvaja, He- lena Holmström Olsson, Jan Bosch, and Markku Oivo. 2016. Towards DevOps in the embedded systems domain: Why is it so hard?. In49th Hawaii International Conference on System Sciences (HICSS). IEEE, 5437–5446

2016
[21]

Adetokunbo Makanju, Nur Zincir-Heywood, and Evangelos E. Milios. 2012. A Lightweight Algorithm for Message Type Extraction in System Application Logs. IEEE Trans. Knowl. Data Eng.24, 11 (2012), 1921–1936. doi:10.1109/TKDE.2011.138

work page doi:10.1109/tkde.2011.138 2012
[22]

Ali Mesbah, Andrew Rice, Emily Johnston, Nick Glorioso, and Edward Aftandilian
[23]

DeepDelta: learning to repair compilation errors. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Tallinn, Estonia, August 26-30, 2019, Marlon Dumas, Dietmar Pfahl, Sven Apel, and Alessandra Russo (Eds.). ACM, 925–936. doi:10.1145/3338906.3340455

work page doi:10.1145/3338906.3340455 2019
[24]

Meiyappan Nagappan, Kesheng Wu, and Mladen A. Vouk. 2009. Efficiently Extracting Operational Profiles from Execution Logs Using Suffix Arrays. In ISSRE 2009, 20th International Symposium on Software Reliability Engineering, Mysuru, Karnataka, India, 16-19 November 2009. IEEE Computer Society, 41–50. doi:10.1109/ISSRE.2009.23

work page doi:10.1109/issre.2009.23 2009
[25]

Stairway to Heaven

Helena Holmström Olsson, Hiva Alahyari, and Jan Bosch. 2012. Climbing the "Stairway to Heaven"—A Multiple-Case Study Exploring Barriers in the Transi- tion from Agile Development towards Continuous Deployment of Software. In 38th Euromicro Conference on Software Engineering and Advanced Applications. IEEE, 392–399

2012
[26]

Nisha Ratti and Parminder Kaur. 2018. A Conceptual Framework for Analysing the Source Code Dependencies. InAdvances in Computer and Computational Sciences, Sanjiv K. Bhatia, Krishn K. Mishra, Shailesh Tiwari, and Vivek Kumar Singh (Eds.). Springer Singapore, Singapore, 333–341

2018
[27]

Kirk Rodrigues, Yu Luo, and Ding Yuan. 2021. CLP: Efficient and Scalable Search on Compressed Text Logs. In15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, July 14-16, 2021, Angela Demke Brown and Jay R. Lorch (Eds.). USENIX Association, 183–198. https://www.usenix.org/ conference/osdi21/presentation/rodrigues

2021
[28]

Nuno Saavedra, André Silva, and Martin Monperrus. 2024. GitBug-Actions: Building Reproducible Bug-Fix Benchmarks with GitHub Actions. InProceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, ICSE Companion 2024, Lisbon, Portugal, April 14-20, 2024. ACM, 1–5. doi:10.1145/3639478.3640023

work page doi:10.1145/3639478.3640023 2024
[29]

Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: generating system events from raw textual logs. InProceedings of the 20th ACM Conference on Information and Knowledge Management, CIKM 2011, Glasgow, United Kingdom, October 24-28, 2011, Craig Macdonald, Iadh Ounis, and Ian Ruthven (Eds.). ACM, 785–794. doi:10.1145/2063576.2063690

work page doi:10.1145/2063576.2063690 2011
[30]

Risto Vaarandi et al . 2003. A data clustering algorithm for mining patterns from event logs. InProceedings of the 3rd IEEE Workshop on IP Operations & Management (IPOM 2003)(IEEE Cat. No. 03EX764). Ieee, 119–126

2003
[31]

Risto Vaarandi and Mauno Pihelgas. 2015. LogCluster - A data clustering and pattern mining algorithm for event logs. In11th International Conference on Network and Service Management, CNSM 2015, Barcelona, Spain, November 9-13, 2015, Mauro Tortonesi, Jürgen Schönwälder, Edmundo Roberto Mauro Madeira, Corinna Schmitt, and Joan Serrat (Eds.). IEEE Computer ...

work page arXiv 2015
[32]

Pablo Valenzuela-Toledo, Alexandre Bergel, Timo Kehrer, and Oscar Nierstrasz
[33]

In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023

EGAD: A moldable tool for GitHub Action analysis. In20th IEEE/ACM International Conference on Mining Software Repositories, MSR 2023, Melbourne, Australia, May 15-16, 2023. IEEE, 260–264. doi:10.1109/MSR59073.2023.00044

work page doi:10.1109/msr59073.2023.00044 2023
[34]

Patterson, and Michael I

Wei Xu, Ling Huang, Armando Fox, David A. Patterson, and Michael I. Jordan
[35]

InPro- ceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, Johannes Fürnkranz and Thorsten Joachims (Eds.)

Detecting Large-Scale System Problems by Mining Console Logs. InPro- ceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, Johannes Fürnkranz and Thorsten Joachims (Eds.). Omnipress, 37–46. https://icml.cc/Conferences/2010/papers/902.pdf

2010
[36]

Farid Zakaria, Thomas R. W. Scogland, Todd Gamblin, and Carlos Maltzahn. 2022. Mapping Out the HPC Dependency Chaos. InSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, Dallas, TX, USA, November 13-18, 2022, Felix Wolf, Sameer Shende, Candace Culhane, Sadaf R. Alam, and Heike Jagode (Eds.). IEEE, 34:1–34:12. ...

work page Pith review doi:10.1109/sc41404.2022.00039 2022
[37]

Xu Zhang, Yong Xu, Si Qin, Shilin He, Bo Qiao, Ze Li, Hongyu Zhang, Xukun Li, Yingnong Dang, Qingwei Lin, Murali Chintalapati, Saravanakumar Rajmohan, and Dongmei Zhang. 2021. Onion: identifying incident-indicating logs for cloud systems. InESEC/FSE 2021: 29th ACM Joint European Software Engineering Confer- ence and Symposium on the Foundations of Softwar...

work page doi:10.1145/3468264.3473919 2021
[38]

Guan, Robert M

Hao-Nan Zhu, Kevin Z. Guan, Robert M. Furth, and Cindy Rubio-González. 2023. Actionsremaker: Reproducing GITHUB Actions. In45th IEEE/ACM International Conference on Software Engineering: ICSE 2023 Companion Proceedings, Melbourne, Australia, May 14-20, 2023. IEEE, 11–15. doi:10.1109/ICSE-COMPANION58688. 2023.00015

work page doi:10.1109/icse-companion58688 2023
[40]

Jieming Zhu, Shilin He, Jinyang Liu, Pinjia He, Qi Xie, Zibin Zheng, and Michael R. Lyu. 2019. Tools and benchmarks for automated log parsing. InProceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE (SEIP) 2019, Montreal, QC, Canada, May 25-31, 2019, Helen Sharp and Mike Whalen (Eds.). IEEE / ACM,...

work page doi:10.1109/icse-seip.2019.00021 2019