Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code

Marcus Kessel

arxiv: 2604.16933 · v1 · submitted 2026-04-18 · 💻 cs.SE

Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code

Marcus Kessel This is my paper

Pith reviewed 2026-05-10 06:57 UTC · model grok-4.3

classification 💻 cs.SE

keywords behavioral co-versioningrun-time execution historysoftware evolutiongit versioningregression analysisexecution archivingtest oracles

0 comments

The pith

Run-time behavior can be co-versioned with source code to support semantic diffing and regression analysis

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Developers routinely version source code with Git but discard rich run-time information from tests, reducing it to simple pass/fail outcomes despite issues like partial oracles, flakiness, and silent drifts. This paper proposes Behavioral Co-Versioning, which couples Git history with a Behavioral Archive that stores selected run-time observations such as method inputs, outputs, and performance signals collected during test runs. The archive is append-only and queryable, keyed by commit and test context, enabling new forms of analysis. A laptop-scale prototype replays historical commits of a Python project and shows detection of behavioral changes not visible in textual diffs.

Core claim

Coupling Git commit history with a queryable Behavioral Archive of run-time observations collected during tests allows semantic diffing, behavior-aware regression localization, and retrospective auditing by querying historical executions, using code/test/behavior fingerprints for diagnostics.

What carries the argument

The Behavioral Archive: an append-only, queryable store of selected run-time observations (method I/O and performance signals) collected during test runs and keyed by commit and test context, implemented in the prototype with a Parquet-backed local store.

If this is right

Semantic diffing of behavior across revisions becomes feasible by comparing fingerprints.
Behavior-aware regression localization can pinpoint which code changes introduced behavioral shifts.
Retrospective auditing allows developers to query and inspect past execution states tied to specific commits.
The approach complements signal-specific monitoring tools by adding historical, queryable context from version history.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration into continuous integration pipelines could enable automatic detection of behavioral drifts before merge.
Expanding the stored signals to include additional execution traces might surface subtler or context-dependent changes.
The same co-versioning idea could extend to other version control systems or non-test execution environments.

Load-bearing premise

That collecting and persisting selected run-time observations during test runs imposes acceptable overhead and that the chosen signals like method I/O and performance are sufficient to capture meaningful behavioral changes across revisions.

What would settle it

Running the prototype on a project with a known silent behavioral change that the system either misses or flags incorrectly, or measuring storage and replay overhead that proves prohibitive for typical development cycles.

read the original abstract

Behavioral Co-Versioning remains absent from mainstream practice: while developers routinely version source code with Git, they rarely persist and query how run-time behavior evolves across revisions. This paper argues that this mismatch contributes to a blind spot in software evolution analysis and CI, where rich execution information is discarded and typically reduced to pass/fail outcomes -- despite partial test oracles, flakiness, and silent output or performance drift. We propose \textit{Behavioral Co-Versioning}, a paradigm that couples the Git history with a \textit{Behavioral Archive}: an append-only, queryable store of selected run-time observations (e.g., method I/O and performance signals) collected during test runs and keyed by commit and test context. This enables semantic diffing, behavior-aware regression localization, and retrospective auditing by querying historical executions, complementing proactive, signal-specific monitoring tools. We first outline a minimal data model and change diagnostics based on code/test/behavior fingerprints, and then demonstrate feasibility with a laptop-scale prototype that replays historical commits of a Python project, archives run-time observations in a local Parquet-backed store, and detects behavioral changes not apparent from textual diffs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper has a functional prototype for co-versioning behavior with Git but lacks measurements on overhead and signal quality.

read the letter

The key point here is that the authors have a working laptop-scale prototype for archiving run-time observations with Git commits, using a Behavioral Archive to enable things like semantic diffing. But the evaluation stops short of measuring overhead or validating the signals. What is new is the framing of co-versioning behavior as a first-class thing, with a minimal data model that fingerprints code, tests, and behaviors such as method I/O and performance metrics. The prototype replays commits from a Python project, collects data during test runs, and stores it in Parquet for querying historical executions. This shows feasibility for spotting non-textual changes. The paper does well at explaining the gap in current CI practices, where execution history gets thrown away beyond pass/fail. It positions the archive as a complement to proactive monitoring, focused on retrospective analysis and regression localization. The main limitations are the lack of any quantitative data on collection overhead, storage requirements, or false positives in change detection. The demo is small and doesn't test if the chosen signals are enough to capture meaningful drifts across revisions. Without that, the claims about practical benefits remain preliminary. This is aimed at software engineering folks interested in evolution analysis and better CI tooling. Readers looking for extensible ideas around version control would find the prototype and data model worth examining. I think it deserves peer review. The concept is coherent and the basic implementation works, so referees could help refine the evaluation and scope.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Behavioral Co-Versioning, a paradigm that couples Git commit history with a Behavioral Archive: an append-only, queryable store of selected run-time observations (e.g., method I/O and performance signals) collected during test runs and keyed by commit and test context. This is claimed to enable semantic diffing, behavior-aware regression localization, and retrospective auditing by querying historical executions. A minimal data model and change diagnostics based on code/test/behavior fingerprints are outlined, with feasibility demonstrated by a laptop-scale prototype that replays historical commits of a Python project, archives observations in a local Parquet-backed store, and detects some behavioral changes not apparent from textual diffs.

Significance. If the practicality assumptions hold, the approach could address a recognized blind spot in software evolution and CI by making execution history first-class and queryable alongside code, complementing proactive monitoring tools. The prototype's replay-and-archive mechanism on a small Python project provides a concrete, reproducible starting point for exploring behavioral co-versioning.

major comments (2)

[Abstract and prototype demonstration] Abstract and prototype demonstration: the claim that the system 'enables semantic diffing, behavior-aware regression localization, and retrospective auditing' rests on the prototype detecting non-textual changes, yet no wall-clock overhead, storage costs, false-positive rates for change detection, or scaling data beyond laptop runs are reported; this quantitative gap is load-bearing for the practicality assertion in realistic CI workflows.
[Behavioral Archive data model] Behavioral Archive and signal selection: the sufficiency of method I/O and performance signals to surface meaningful behavioral changes (as opposed to noise or flakiness) is not evaluated against any ground-truth regressions or drift cases; without such validation, the assertion that the archive complements textual diffs remains an untested assumption central to the enablement claim.

minor comments (2)

The data model description would benefit from an explicit table or diagram enumerating the fingerprint components (code, test context, behavior observations) and their storage format in Parquet.
[Introduction] Clarify in the introduction whether the Behavioral Archive is intended to be integrated into existing CI systems or to operate as a standalone post-processing step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The two major comments highlight important gaps in the quantitative evaluation of the prototype and the validation of the Behavioral Archive signals. We address each point below and will incorporate clarifications and additional data into the revised manuscript.

read point-by-point responses

Referee: Abstract and prototype demonstration: the claim that the system 'enables semantic diffing, behavior-aware regression localization, and retrospective auditing' rests on the prototype detecting non-textual changes, yet no wall-clock overhead, storage costs, false-positive rates for change detection, or scaling data beyond laptop runs are reported; this quantitative gap is load-bearing for the practicality assertion in realistic CI workflows.

Authors: We agree that quantitative metrics on overhead, storage, false-positive rates, and scaling are necessary to substantiate practicality claims for CI workflows. The prototype was designed as a minimal, reproducible feasibility demonstration on a small Python project rather than a performance benchmark. In the revision we will add wall-clock timing and storage measurements for the replay-and-archive process on the evaluated commits, report the observed change detections with a brief analysis of potential false positives arising from test flakiness, and explicitly state the laptop-scale scope as a limitation. Comprehensive scaling experiments and CI integration studies are noted as future work. revision: partial
Referee: Behavioral Archive and signal selection: the sufficiency of method I/O and performance signals to surface meaningful behavioral changes (as opposed to noise or flakiness) is not evaluated against any ground-truth regressions or drift cases; without such validation, the assertion that the archive complements textual diffs remains an untested assumption central to the enablement claim.

Authors: We acknowledge that the paper does not provide a ground-truth evaluation of signal sufficiency. The Behavioral Archive is presented as a minimal, extensible data model whose purpose is to make selected run-time observations queryable alongside commits; the prototype illustrates that certain behavioral differences can be detected where textual diffs are silent. In the revision we will add an explicit discussion of signal noise and flakiness risks, clarify that the current contribution centers on the co-versioning paradigm and data model rather than exhaustive signal validation, and note that systematic studies against labeled regression corpora are planned as follow-on research. revision: partial

Circularity Check

0 steps flagged

No circularity: conceptual proposal with independent prototype demo

full rationale

The paper advances a system proposal for Behavioral Co-Versioning by coupling Git history with a Behavioral Archive of run-time observations, outlining a minimal data model and demonstrating feasibility via a laptop-scale prototype that replays commits and archives observations in Parquet. No equations, fitted parameters, predictions, or self-citation chains exist that reduce any claim to its own inputs by construction; the enablement of semantic diffing and auditing is presented as a direct consequence of the described architecture and prototype execution rather than a tautological renaming or fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The proposal rests on the domain assumption that selected run-time signals can be collected non-intrusively during tests and that Git's commit model can be extended with an external archive without breaking existing workflows.

axioms (1)

domain assumption Run-time observations can be captured and keyed reliably by commit and test context without altering program semantics.
Invoked in the description of the Behavioral Archive and prototype replay mechanism.

invented entities (1)

Behavioral Archive no independent evidence
purpose: Append-only, queryable store of run-time observations keyed by commit and test context.
New construct introduced to couple execution history with Git; no independent evidence provided beyond the prototype sketch.

pith-pipeline@v0.9.0 · 5499 in / 1246 out tokens · 29829 ms · 2026-05-10T06:57:35.280913+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

DuckDB — An in-process SQL OLAP database management system

2026. DuckDB — An in-process SQL OLAP database management system. https: //duckdb.org/. Accessed: 2026-03-31

work page 2026
[2]

Hive Partitioning – DuckDB Documentation

2026. Hive Partitioning – DuckDB Documentation. https://duckdb.org/docs/ stable/data/partitioning/hive_partitioning. Accessed: 2026-03-31

work page 2026
[3]

Parquet File Format Documentation

2026. Parquet File Format Documentation. https://parquet.apache.org/docs/file- format/. Accessed: 2026-03-31

work page 2026
[4]

Eman Abdullah AlOmar, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2021. On preserving the behavior in software refactoring: A systematic mapping study.Information and Software Technology140 (2021), 106675. doi:10. 1016/j.infsof.2021.106675

work page arXiv 2021
[5]

Amazon Web Services. 2026. Amazon S3 data lakes for the lakehouse architecture of Amazon SageMaker. https://docs.aws.amazon.com/sagemaker-lakehouse- architecture/latest/userguide/s3-data-lakes.html. Accessed: 2026-04-01

work page 2026
[6]

2017.Introduction to software testing

Paul Ammann and Jeff Offutt. 2017.Introduction to software testing. Cambridge University Press

work page 2017
[7]

Xin, Ali Ghodsi, and Matei Zaharia

Michael Armbrust, Tathagata Das, Xian Zhu, Saeed Tabrizian, Reynold S. Xin, Ali Ghodsi, and Matei Zaharia. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. InProceedings of the 2021 Conference on Innovative Data Systems Research (CIDR 2021). http: //cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

work page 2021
[8]

Thoms Ball. 1999. The concept of dynamic analysis.SIGSOFT Softw. Eng. Notes 24, 6 (Oct. 1999), 216–234. doi:10.1145/318774.318944

work page doi:10.1145/318774.318944 1999
[9]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

work page
[10]

The Oracle Problem in Software Testing: A Survey,

The Oracle Problem in Software Testing: A Survey.IEEE Transactions on Software Engineering41, 5 (2015), 507–525. doi:10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015
[11]

Bas Cornelissen, Andy Zaidman, Arie van Deursen, Leon Moonen, and Rainer Koschke. 2009. A Systematic Survey of Program Comprehension through Dy- namic Analysis.IEEE Transactions on Software Engineering35, 5 (2009), 684–702. doi:10.1109/TSE.2009.28

work page doi:10.1109/tse.2009.28 2009
[12]

Benjamin Danglot, Oscar Luis Vera-Pérez, Benoit Baudry, and Martin Monperrus

work page
[13]

Automatic test improvement with DSpot: a study with ten mature open- source projects.Empirical Software Engineering24, 4 (2019), 2603–2635

work page 2019
[14]

Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray

Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Mod- els with Comprehensive Semantics Reasoning. InAdvances in Neu- ral Information Processing Systems, Vol. 37. Curran Associates, Inc., 60275–60308. https://proceedings.neurips.cc/paper_files/paper/2024/file/ 6efcc7fd8efeee29a...

work page 2024
[15]

Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for im- proving regression testing in continuous integration development environments. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering(Hong Kong, China)(FSE 2014). Association for Comput- ing Machinery, New York, NY, USA, 235–245. doi:1...

work page doi:10.1145/2635868.2635910 2014
[16]

Michael D Ernst. 2003. Static and dynamic analysis: Synergy and duality. In WODA 2003: ICSE Workshop on Dynamic Analysis. 24–27

work page 2003
[17]

Hassan, Ying Zou, and Parminder Flora

King Chun Foo, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan, Ying Zou, and Parminder Flora. 2010. Mining Performance Regression Testing Repositories for Automated Performance Analysis. In2010 10th International Conference on Quality Software. 32–41. doi:10.1109/QSIC.2010.35

work page doi:10.1109/qsic.2010.35 2010
[18]

Martin Fowler. 2006. Continuous Integration. https://martinfowler.com/articles/ continuousIntegration.html Accessed: 2026-04-01

work page 2006
[19]

Mohammad Ghafari, Carlo Ghezzi, and Konstantin Rubinov. 2015. Automatically identifying focal methods under test in unit test cases. In2015 IEEE 15th Inter- national Working Conference on Source Code Analysis and Manipulation (SCAM). 61–70. doi:10.1109/SCAM.2015.7335402

work page doi:10.1109/scam.2015.7335402 2015
[20]

Git Development Team. 2026. Git Documentation. https://git-scm.com/docs. Accessed: 2026-03-31

work page 2026
[21]

Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. InProceedings of the 36th International Conference on Software Engineering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 435–445. doi:10.1145/2568225.2568271

work page doi:10.1145/2568225.2568271 2014
[22]

2025.Observation Lakehouse: A Python library for stor- ing and querying stimulus–response observations

Marcus Kessel. 2025.Observation Lakehouse: A Python library for stor- ing and querying stimulus–response observations. https://github.com/ SoftwareObservatorium/observation-lakehouse Accessed: 2026-03-31

work page 2025
[23]

Marcus Kessel. 2026. Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior. arXiv:2512.02795 [cs.SE] https://arxiv.org/abs/ 2512.02795 to appear in 2026 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’26)

work page arXiv 2026
[24]

Marcus Kessel. 2026. Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code. doi:10.5281/zenodo.19398211 Prototype and Dataset

work page doi:10.5281/zenodo.19398211 2026
[25]

Marcus Kessel and Colin Atkinson. 2024. Promoting open science in test-driven software experiments.Journal of Systems and Software212 (2024), 111971. doi:10. 1016/j.jss.2024.111971

work page arXiv 2024
[26]

Marcus Kessel and Colin Atkinson. 2025. Morescient GAI for Software Engineer- ing.ACM Trans. Softw. Eng. Methodol.34, 5, Article 123 (May 2025), 17 pages. doi:10.1145/3709354

work page doi:10.1145/3709354 2025
[27]

Holger Krekel and pytest-dev Team. 2025. pytest — The pytest documentation (stable). https://docs.pytest.org/en/stable/. Accessed: 2025-10-22

work page 2025
[28]

Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel. 2023. HyperDiff: Computing Source Code Diffs at Scale. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, New Yor...

work page doi:10.1145/3611643.3616312 2023
[29]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InProceedings of the 22nd ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering(Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 643–653. doi:10.1145/2635868.2635920

work page doi:10.1145/2635868.2635920 2014
[30]

William M McKeeman. 1998. Differential testing for software.Digital Technical Journal10, 1 (1998), 100–107

work page 1998
[31]

Gustavo Niemeyer, Tomi Pieviläinen, Yaron de Leeuw, Paul Ganssle, et al. 2024. dateutil: Useful extensions to the standard Python datetime features. GitHub. https://github.com/dateutil/dateutil/ Accessed: 2026-01-22

work page 2024
[32]

H. G. Rice. 1953. Classes of Recursively Enumerable Sets and Their Decision Problems.Trans. Amer. Math. Soc.74, 2 (1953), 358–366. http://www.jstor.org/ stable/1990888

work page arXiv 1953
[33]

David Schuler and Andreas Zeller. 2011. Assessing Oracle Quality with Checked Coverage. In2011 Fourth IEEE International Conference on Software Testing, Verifi- cation and Validation. 90–99. doi:10.1109/ICST.2011.32

work page doi:10.1109/icst.2011.32 2011
[34]

Software Observatorium Documentation. 2026. SSN – Sequence Sheet Notation (Version 0.2). https://softwareobservatorium.github.io/web/docs/datastructures/ ssn/. Accessed: 2026-03-31

work page 2026
[35]

Jeongju Sohn and Mike Papadakis. 2022. CEMENT: On the Use of Evolutionary Coupling Between Tests and Code Units. A Case Study on Fault Localization. In2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). 133–144. doi:10.1109/ISSRE55969.2022.00023

work page doi:10.1109/issre55969.2022.00023 2022
[36]

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE 2018). Association for Computing Machinery, New...

work page doi:10.1145/3236024.3264598 2018
[37]

Masoumeh Taromirad and Per Runeson. 2025. Assertions in software testing: sur- vey, landscape, and trends.International Journal on Software Tools for Technology Transfer27, 1 (2025), 117–135

work page 2025
[38]

Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (2016), 707–740. doi:10.1109/TSE.2016.2521368

work page doi:10.1109/tse.2016.2521368 2016

[1] [1]

DuckDB — An in-process SQL OLAP database management system

2026. DuckDB — An in-process SQL OLAP database management system. https: //duckdb.org/. Accessed: 2026-03-31

work page 2026

[2] [2]

Hive Partitioning – DuckDB Documentation

2026. Hive Partitioning – DuckDB Documentation. https://duckdb.org/docs/ stable/data/partitioning/hive_partitioning. Accessed: 2026-03-31

work page 2026

[3] [3]

Parquet File Format Documentation

2026. Parquet File Format Documentation. https://parquet.apache.org/docs/file- format/. Accessed: 2026-03-31

work page 2026

[4] [4]

Eman Abdullah AlOmar, Mohamed Wiem Mkaouer, Christian Newman, and Ali Ouni. 2021. On preserving the behavior in software refactoring: A systematic mapping study.Information and Software Technology140 (2021), 106675. doi:10. 1016/j.infsof.2021.106675

work page arXiv 2021

[5] [5]

Amazon Web Services. 2026. Amazon S3 data lakes for the lakehouse architecture of Amazon SageMaker. https://docs.aws.amazon.com/sagemaker-lakehouse- architecture/latest/userguide/s3-data-lakes.html. Accessed: 2026-04-01

work page 2026

[6] [6]

2017.Introduction to software testing

Paul Ammann and Jeff Offutt. 2017.Introduction to software testing. Cambridge University Press

work page 2017

[7] [7]

Xin, Ali Ghodsi, and Matei Zaharia

Michael Armbrust, Tathagata Das, Xian Zhu, Saeed Tabrizian, Reynold S. Xin, Ali Ghodsi, and Matei Zaharia. 2021. Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics. InProceedings of the 2021 Conference on Innovative Data Systems Research (CIDR 2021). http: //cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

work page 2021

[8] [8]

Thoms Ball. 1999. The concept of dynamic analysis.SIGSOFT Softw. Eng. Notes 24, 6 (Oct. 1999), 216–234. doi:10.1145/318774.318944

work page doi:10.1145/318774.318944 1999

[9] [9]

Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

Earl T. Barr, Mark Harman, Phil McMinn, Muzammil Shahbaz, and Shin Yoo

work page

[10] [10]

The Oracle Problem in Software Testing: A Survey,

The Oracle Problem in Software Testing: A Survey.IEEE Transactions on Software Engineering41, 5 (2015), 507–525. doi:10.1109/TSE.2014.2372785

work page doi:10.1109/tse.2014.2372785 2015

[11] [11]

Bas Cornelissen, Andy Zaidman, Arie van Deursen, Leon Moonen, and Rainer Koschke. 2009. A Systematic Survey of Program Comprehension through Dy- namic Analysis.IEEE Transactions on Software Engineering35, 5 (2009), 684–702. doi:10.1109/TSE.2009.28

work page doi:10.1109/tse.2009.28 2009

[12] [12]

Benjamin Danglot, Oscar Luis Vera-Pérez, Benoit Baudry, and Martin Monperrus

work page

[13] [13]

Automatic test improvement with DSpot: a study with ten mature open- source projects.Empirical Software Engineering24, 4 (2019), 2603–2635

work page 2019

[14] [14]

Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray

Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Mod- els with Comprehensive Semantics Reasoning. InAdvances in Neu- ral Information Processing Systems, Vol. 37. Curran Associates, Inc., 60275–60308. https://proceedings.neurips.cc/paper_files/paper/2024/file/ 6efcc7fd8efeee29a...

work page 2024

[15] [15]

Sebastian Elbaum, Gregg Rothermel, and John Penix. 2014. Techniques for im- proving regression testing in continuous integration development environments. InProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering(Hong Kong, China)(FSE 2014). Association for Comput- ing Machinery, New York, NY, USA, 235–245. doi:1...

work page doi:10.1145/2635868.2635910 2014

[16] [16]

Michael D Ernst. 2003. Static and dynamic analysis: Synergy and duality. In WODA 2003: ICSE Workshop on Dynamic Analysis. 24–27

work page 2003

[17] [17]

Hassan, Ying Zou, and Parminder Flora

King Chun Foo, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan, Ying Zou, and Parminder Flora. 2010. Mining Performance Regression Testing Repositories for Automated Performance Analysis. In2010 10th International Conference on Quality Software. 32–41. doi:10.1109/QSIC.2010.35

work page doi:10.1109/qsic.2010.35 2010

[18] [18]

Martin Fowler. 2006. Continuous Integration. https://martinfowler.com/articles/ continuousIntegration.html Accessed: 2026-04-01

work page 2006

[19] [19]

Mohammad Ghafari, Carlo Ghezzi, and Konstantin Rubinov. 2015. Automatically identifying focal methods under test in unit test cases. In2015 IEEE 15th Inter- national Working Conference on Source Code Analysis and Manipulation (SCAM). 61–70. doi:10.1109/SCAM.2015.7335402

work page doi:10.1109/scam.2015.7335402 2015

[20] [20]

Git Development Team. 2026. Git Documentation. https://git-scm.com/docs. Accessed: 2026-03-31

work page 2026

[21] [21]

Laura Inozemtseva and Reid Holmes. 2014. Coverage is not strongly correlated with test suite effectiveness. InProceedings of the 36th International Conference on Software Engineering(Hyderabad, India)(ICSE 2014). Association for Computing Machinery, New York, NY, USA, 435–445. doi:10.1145/2568225.2568271

work page doi:10.1145/2568225.2568271 2014

[22] [22]

2025.Observation Lakehouse: A Python library for stor- ing and querying stimulus–response observations

Marcus Kessel. 2025.Observation Lakehouse: A Python library for stor- ing and querying stimulus–response observations. https://github.com/ SoftwareObservatorium/observation-lakehouse Accessed: 2026-03-31

work page 2025

[23] [23]

Marcus Kessel. 2026. Towards Observation Lakehouses: Living, Interactive Archives of Software Behavior. arXiv:2512.02795 [cs.SE] https://arxiv.org/abs/ 2512.02795 to appear in 2026 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’26)

work page arXiv 2026

[24] [24]

Marcus Kessel. 2026. Treating Run-time Execution History as a First-Class Citizen: Co-Versioning Run-time Behavior alongside Code. doi:10.5281/zenodo.19398211 Prototype and Dataset

work page doi:10.5281/zenodo.19398211 2026

[25] [25]

Marcus Kessel and Colin Atkinson. 2024. Promoting open science in test-driven software experiments.Journal of Systems and Software212 (2024), 111971. doi:10. 1016/j.jss.2024.111971

work page arXiv 2024

[26] [26]

Marcus Kessel and Colin Atkinson. 2025. Morescient GAI for Software Engineer- ing.ACM Trans. Softw. Eng. Methodol.34, 5, Article 123 (May 2025), 17 pages. doi:10.1145/3709354

work page doi:10.1145/3709354 2025

[27] [27]

Holger Krekel and pytest-dev Team. 2025. pytest — The pytest documentation (stable). https://docs.pytest.org/en/stable/. Accessed: 2025-10-22

work page 2025

[28] [28]

Quentin Le Dilavrec, Djamel Eddine Khelladi, Arnaud Blouin, and Jean-Marc Jézéquel. 2023. HyperDiff: Computing Source Code Diffs at Scale. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(San Francisco, CA, USA)(ESEC/FSE 2023). Association for Computing Machinery, New Yor...

work page doi:10.1145/3611643.3616312 2023

[29] [29]

Qingzhou Luo, Farah Hariri, Lamyaa Eloussi, and Darko Marinov. 2014. An empirical analysis of flaky tests. InProceedings of the 22nd ACM SIGSOFT Inter- national Symposium on Foundations of Software Engineering(Hong Kong, China) (FSE 2014). Association for Computing Machinery, New York, NY, USA, 643–653. doi:10.1145/2635868.2635920

work page doi:10.1145/2635868.2635920 2014

[30] [30]

William M McKeeman. 1998. Differential testing for software.Digital Technical Journal10, 1 (1998), 100–107

work page 1998

[31] [31]

Gustavo Niemeyer, Tomi Pieviläinen, Yaron de Leeuw, Paul Ganssle, et al. 2024. dateutil: Useful extensions to the standard Python datetime features. GitHub. https://github.com/dateutil/dateutil/ Accessed: 2026-01-22

work page 2024

[32] [32]

H. G. Rice. 1953. Classes of Recursively Enumerable Sets and Their Decision Problems.Trans. Amer. Math. Soc.74, 2 (1953), 358–366. http://www.jstor.org/ stable/1990888

work page arXiv 1953

[33] [33]

David Schuler and Andreas Zeller. 2011. Assessing Oracle Quality with Checked Coverage. In2011 Fourth IEEE International Conference on Software Testing, Verifi- cation and Validation. 90–99. doi:10.1109/ICST.2011.32

work page doi:10.1109/icst.2011.32 2011

[34] [34]

Software Observatorium Documentation. 2026. SSN – Sequence Sheet Notation (Version 0.2). https://softwareobservatorium.github.io/web/docs/datastructures/ ssn/. Accessed: 2026-03-31

work page 2026

[35] [35]

Jeongju Sohn and Mike Papadakis. 2022. CEMENT: On the Use of Evolutionary Coupling Between Tests and Code Units. A Case Study on Fault Localization. In2022 IEEE 33rd International Symposium on Software Reliability Engineering (ISSRE). 133–144. doi:10.1109/ISSRE55969.2022.00023

work page doi:10.1109/issre55969.2022.00023 2022

[36] [36]

Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. InProceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Lake Buena Vista, FL, USA)(ESEC/FSE 2018). Association for Computing Machinery, New...

work page doi:10.1145/3236024.3264598 2018

[37] [37]

Masoumeh Taromirad and Per Runeson. 2025. Assertions in software testing: sur- vey, landscape, and trends.International Journal on Software Tools for Technology Transfer27, 1 (2025), 117–135

work page 2025

[38] [38]

Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa

W. Eric Wong, Ruizhi Gao, Yihao Li, Rui Abreu, and Franz Wotawa. 2016. A Survey on Software Fault Localization.IEEE Transactions on Software Engineering 42, 8 (2016), 707–740. doi:10.1109/TSE.2016.2521368

work page doi:10.1109/tse.2016.2521368 2016