Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies
Pith reviewed 2026-05-10 08:06 UTC · model grok-4.3
The pith
An automated benchmarking pipeline with continuous integration features enables reproducible and reusable results for evolving HPC systems and models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that concepts of an automated benchmarking pipeline, incorporating user-agnostic operations and continuous benchmarking inspired by continuous integration, can be implemented to foster reproducibility and re-use of benchmarking results for high performance applications, allowing the community to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.
What carries the argument
The automated benchmarking pipeline extended with user-agnostic operations and continuous features, designed to support customization, collaboration, and re-use.
If this is right
- Reproducibility of benchmarking results increases through automation and continuous monitoring.
- Re-use of results across community efforts supports sustainable technological progress in HPC.
- Customization options allow adaptation to specific research software needs in neuroscience and AI.
- Collaboration is facilitated by user-agnostic operations that reduce barriers for contributors.
- The pipeline helps maintain pace with rapid changes in models and computing systems.
Where Pith is reading between the lines
- Integration of this pipeline with existing continuous integration platforms could lower the barrier for smaller research teams to adopt systematic benchmarking.
- Continuous benchmarking might enable earlier detection of performance issues when new hardware or model versions are introduced.
- The emphasis on re-use could lead to shared repositories of benchmark results that reduce redundant computations across institutions.
- Adoption in other scientific domains beyond neuroscience and AI would test the generality of the user-agnostic design.
Load-bearing premise
That the described automated benchmarking pipeline can be realized with user-agnostic operations and continuous features in a way that actually delivers customization, collaboration, and re-use without further technical specification or validation.
What would settle it
A controlled test showing that the pipeline produces no measurable gains in reproducibility or result re-use compared to standard manual benchmarking workflows on an evolving neuroscience model would falsify the claim.
Figures
read the original abstract
Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of research-software development as a continuous community effort. We have extended our previous conceptual work on systematic benchmarking workflows with the functionality of user-agnostic operations as well as continuous benchmarking. This fosters reproducibility and re-use of benchmarking results to ensure sustainable technological progress. We provide software-engineering solutions to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents concepts for an automated benchmarking pipeline for high-performance applications, drawing on continuous integration ideas. Customization and collaboration are highlighted as core design goals for research software development. It extends prior conceptual work on systematic benchmarking workflows by incorporating user-agnostic operations and continuous benchmarking to promote reproducibility, re-use of results, and sustainable progress amid rapid evolution of large-scale models and HPC systems, with a focus on neuroscience and AI domains.
Significance. If realized, the concepts could help address the challenge of maintaining relevant benchmarks in rapidly changing HPC and AI ecosystems by enabling ongoing, community-oriented evaluation. The emphasis on user-agnostic features and CI analogies offers a potentially useful framework for reproducibility, though the absence of concrete mechanisms or validation means the significance remains prospective rather than demonstrated.
major comments (2)
- Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.
- Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.
minor comments (1)
- The abstract invokes CI analogies but does not clarify how they map to benchmarking specifics, which could be clarified for better readability.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript describing concepts for a continuous benchmarking pipeline. We appreciate the acknowledgment of the potential impact in addressing challenges in evolving HPC and AI ecosystems. We address each major comment below and have made revisions to the manuscript to clarify our conceptual contributions.
read point-by-point responses
-
Referee: Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.
Authors: We agree that the abstract, being concise, does not fully elaborate on these aspects. In the body of the manuscript, user-agnostic operations are defined as benchmarking steps that operate independently of individual user environments, relying instead on standardized interfaces and shared resources. The data model for results incorporates versioning to handle model and system evolution, ensuring that customizations are preserved through modular, dependency-free configurations. We will revise the abstract to briefly include these definitions and highlight the handling of evolution, thereby supporting the claim more explicitly. revision: yes
-
Referee: Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.
Authors: As the manuscript presents a conceptual framework rather than an implemented system, we intentionally focused on high-level ideas drawn from continuous integration practices. However, we recognize that providing a high-level architecture diagram and workflow examples would aid evaluation. We will include these in the revised manuscript, along with a discussion of feasibility based on our prior systematic benchmarking workflows. Full empirical validation of the continuous features is planned for future work but is outside the scope of this conceptual paper. revision: partial
Circularity Check
Conceptual proposal with minor self-reference to prior work; no derivation or prediction reduces to inputs
full rationale
The manuscript is a high-level conceptual paper that extends the authors' previous work on systematic benchmarking workflows by adding user-agnostic operations and continuous benchmarking features, drawing analogies to continuous integration. No equations, fitted parameters, derivations, or quantitative predictions appear in the provided text or abstract. The self-reference to prior conceptual work serves only as background for the proposed extension and is not invoked to establish uniqueness, forbid alternatives, or force a result by construction. All claims about reproducibility, re-use, and sustainable progress remain design goals without reduction to self-definitional or fitted elements, rendering the proposal self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Software in Science Is Ubiquitous yet Overlooked
A. Hocquet et al. “Software in Science Is Ubiquitous yet Overlooked”. Nature Computational Science (2024)
work page 2024
-
[2]
N. M. Ferguson et al. “Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Imperial College COVID- 19 Response Team” (2020)
work page 2020
-
[3]
MODELLING THE PANDEMIC The simulations driving the world’s response to COVID- 19
D. Adam. “MODELLING THE PANDEMIC The simulations driving the world’s response to COVID- 19”. Nature 580.7803 (2020), pp. 316–318. 15
work page 2020
- [4]
-
[5]
The research software engineer
R. Baxter et al. “The research software engineer”. In: Digital Research 2012. Oxford, United King- dom, 2012, pp. 1–3
work page 2012
-
[6]
Research Software Engi- neering
R. Speck and C. Wyatt. “Research Software Engi- neering”. In: RWTH Themen - Research Software Engineering. Ed. by M. Diesmann, J. Kowalski, and B. Rumpe. Vol. 1/2024. RWTH Themen. 2024, pp. 8–10
work page 2024
-
[7]
W. L. Oberkampf and C. J. Roy. Verification and Validation in Scientific Computing. Cambridge Uni- versity Press, 2010
work page 2010
-
[8]
M. Shahin, M. Ali Babar, and L. Zhu. “Continuous Integration, Delivery and Deployment: A System- atic Review on Approaches, Tools, Challenges and Practices”. IEEE Access 5 (2017), 3909–3943
work page 2017
-
[9]
P. W. Anderson. “More Is Different”.Science 177.4047 (1972), pp. 393–396
work page 1972
-
[10]
S. J. van Albada, M. Helias, and M. Diesmann. “Scalability of Asynchronous Networks Is Limited by One-to-One Mapping between Effective Con- nectivity and Correlations”. PLOS Computational Biology 11.9 (2015). Ed. by P. E. Latham, e1004490
work page 2015
-
[11]
A Modular Workflow for Perfor- mance Benchmarking of Neuronal Network Simu- lations
J. Albers et al. “A Modular Workflow for Perfor- mance Benchmarking of Neuronal Network Simu- lations”. Frontiers in Neuroinformatics 16 (2022), p. 837549
work page 2022
-
[12]
M.-O. Gewaltig and M. Diesmann. “NEST (NEural Simulation Tool)”. Scholarpedia Journal 2.4 (2007), p. 1430
work page 2007
-
[13]
Metadata practices for simula- tion workflows
J. Villamar et al. “Metadata practices for simula- tion workflows”. Scientific Data 12.1 (2025), pp. 1– 18
work page 2025
-
[14]
“You Don’t Meet Anybody When Walking from the Living Room to the Kitchen
B. Viererbl, N. Denner, and T. Koch. ““You Don’t Meet Anybody When Walking from the Living Room to the Kitchen”: Informal Communication during Remote Work”. Journal of Communication Management 26.3 (2022), pp. 331–348
work page 2022
-
[15]
Leveraging DevOps for scientific computing
P. Nuyujukian. “Leveraging DevOps for scientific computing”. arXiv preprint arXiv:2310.08247 (2023)
-
[16]
J. U. EuroHPC. “GOVERNING BOARD OF THE EuroHPC JOINT UNDERTAKING No 11/2024 Amending the Joint Undertaking’s Work Programme and Budget for the year 2024 (Amendment no 1)”. ref. EC Regulations (EU) 2018/1488 and (EU) 2021/1173 (2024)
work page 2024
-
[17]
H. Anzt et al. “Towards continuous benchmarking: An automated performance evaluation framework for high performance software”. In: Proceedings of the platform for advanced scientific computing conference. 2019, pp. 1–11
work page 2019
-
[18]
Towards Collaborative Continu- ous Benchmarking for HPC
O. Pearce et al. “Towards Collaborative Continu- ous Benchmarking for HPC”. In: Proceedings of the SC ’23 Workshops of the International Confer- ence on High Performance Computing, Network, Storage, and Analysis. SC-W ’23. New York, NY, USA: Association for Computing Machinery, 2023, 627–635
work page 2023
-
[19]
J. Badwaik et al. exaCB: Reproducible Continu- ous Benchmark Collections at Scale Leveraging an Incremental Approach. 2026
work page 2026
- [20]
-
[21]
Y. Zhao et al. “The impact of continuous inte- gration on other software development practices: a large-scale empirical study”. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. 2017, pp. 60–71
work page 2017
-
[22]
V. Hayot-Sasson et al. Addressing Reproducibility Challenges in HPC with Continuous Integration . 2025
work page 2025
-
[23]
JUSUF: Modular Tier-2 Su- percomputing and Cloud Infrastructure at J¨ ulich Supercomputing Centre
B. Von St. Vieth. “JUSUF: Modular Tier-2 Su- percomputing and Cloud Infrastructure at J¨ ulich Supercomputing Centre”. J. Large-scale Res. Facil. JLSRF 7.A179 (2021)
work page 2021
-
[24]
P. Th¨ ornig and B. von St. Vieth. “JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at J¨ ulich Supercomputing Centre”. JLSRF 7 (2021), A182
work page 2021
-
[25]
JUWELS: Modular Tier-0/1 Super- computer at J¨ ulich Supercomputing Centre
D. Krause. “JUWELS: Modular Tier-0/1 Super- computer at J¨ ulich Supercomputing Centre”. J. Large-scale Res. Facil. JLSRF 5.A135 (2019)
work page 2019
-
[26]
Application-Driven Exascale: The JUPITER Benchmark Suite
A. Herten et al. “Application-Driven Exascale: The JUPITER Benchmark Suite”. In: SC24: Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis . 2024, pp. 1–45
work page 2024
-
[27]
Extremely Scalable Spiking Neu- ronal Network Simulation Code: From Laptops to Exascale Computers
J. Jordan et al. “Extremely Scalable Spiking Neu- ronal Network Simulation Code: From Laptops to Exascale Computers”. Frontiers in Neuroinformat- ics 12 (2018), p. 2
work page 2018
-
[28]
Spiking network simulation code for petascale computers
S. Kunkel et al. “Spiking network simulation code for petascale computers”. Frontiers in Neuroinfor- matics 8 (2014), p. 78
work page 2014
-
[29]
T. C. Potjans and M. Diesmann. “The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model”. Cerebral Cortex 24.3 (2014), pp. 785–806
work page 2014
-
[30]
M. Schmidt et al. “A multi-scale layer-resolved spiking network model of resting-state dynamics in macaque visual cortical areas”. PLOS Computa- tional Biology 14.10 (2018), e1006359
work page 2018
- [31]
- [32]
-
[33]
Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex
S. J. van Albada et al. “Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex”. In: Lecture Notes in Computer Science . Cham, Switzerland: Springer International Publish- ing, 2021, pp. 47–59
work page 2021
-
[34]
Routing Brain Traffic Through the Von Neumann Bottleneck: Parallel Sorting and Refactoring
J. Pronold et al. “Routing Brain Traffic Through the Von Neumann Bottleneck: Parallel Sorting and Refactoring”. Frontiers in Neuroinformatics 15 (2022), p. 785068
work page 2022
-
[35]
J. Pronold et al. “Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers”. Parallel computing 113 (2022), p. 102952
work page 2022
-
[36]
Sub-realtime simulation of a neu- ronal network of natural density
A. C. Kurth et al. “Sub-realtime simulation of a neu- ronal network of natural density”. Neuromorphic Computing and Engineering 2.2 (2022), p. 021001
work page 2022
-
[37]
D. O. Hebb. The organization of behavior: A neu- ropsychological theory. New York: John Wiley & Sons, 1949
work page 1949
-
[38]
G. Bi and M. Poo. “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynap- tic Cell Type”. Journal of Neuroscience 18 (1998), pp. 10464–10472
work page 1998
-
[39]
Precise spike timing with exact subthreshold integration in discrete time network simulations
A. Morrison et al. “Precise spike timing with exact subthreshold integration in discrete time network simulations”. In: Proceedings of the 30th G¨ ottingen Neurobiology Conference. 2005, 205B
work page 2005
-
[40]
Spike- Timing Dependent Plasticity in Balanced Ran- dom Networks
A. Morrison, A. Aertsen, and M. Diesmann. “Spike- Timing Dependent Plasticity in Balanced Ran- dom Networks”. Neural Computation 19 (2007), pp. 1437–1467
work page 2007
-
[41]
The role of metadata in repro- ducible computational research
J. Leipzig et al. “The role of metadata in repro- ducible computational research”.Patterns 2.9 (2021)
work page 2021
-
[42]
J. B. Aimone et al. “Editorial: Neuroscience, com- puting, performance, and benchmarks: Why it mat- ters to neuroscience how fast we can compute”. Frontiers in Neuroinformatics 17 (2023)
work page 2023
-
[43]
J. Pronold et al. Code for ”Routing brain traf- fic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers”. Version version
-
[44]
Phe- nomenological models of synaptic plasticity based on spike-timing
A. Morrison, M. Diesmann, and W. Gerstner. “Phe- nomenological models of synaptic plasticity based on spike-timing”. Biological Cybernetics 98.6 (2008), pp. 459–478
work page 2008
-
[45]
A Fast, Compact Approxima- tion of the Exponential Function
N. N. Schraudolph. “A Fast, Compact Approxima- tion of the Exponential Function”. Neural Compu- tation 11.4 (1999), pp. 853–862
work page 1999
-
[46]
On a Fast, Compact Approximation of the Exponential Function
G. C. Cawley. “On a Fast, Compact Approximation of the Exponential Function”. Neural Computation 12 (2000), pp. 2009–2012
work page 2000
-
[47]
Fast exponential compu- tation on simd architectures
A. C. I. Malossi et al. “Fast exponential compu- tation on simd architectures”. Proc. of HIPEAC- WAPCO, Amsterdam NL 56 (2015), p. 224
work page 2015
-
[48]
Accelerating Event-Driven Sim- ulation of Spiking Neurons with Multiple Synaptic Time Constants
M. D’Haene et al. “Accelerating Event-Driven Sim- ulation of Spiking Neurons with Multiple Synaptic Time Constants”. Neural Computation 21.4 (2009), pp. 1068–1099
work page 2009
-
[49]
A fixed point exponential func- tion accelerator for a neuromorphic many-core sys- tem
J. Partzsch et al. “A fixed point exponential func- tion accelerator for a neuromorphic many-core sys- tem”. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS) . 2017, pp. 1–4
work page 2017
-
[50]
A. Morrison et al. “Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neu- ral Network Simulations”.Neural Computation 19.1 (2007), pp. 47–79
work page 2007
-
[51]
A. Hanuschkin et al. “A general and efficient method for incorporating precise spike times in globally time-driven simulations”. Frontiers in Neuroinfor- matics 4 (2010), p. 113. 17 5 Supplementary Information SI 1.1 Use case: Barrier-free spike delivery Before the 5g simulation kernel, all spikes were always communicated in one go, independent of the num...
work page 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.