pith. sign in

arxiv: 2604.15919 · v2 · submitted 2026-04-17 · 💻 cs.DC

Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies

Pith reviewed 2026-05-10 08:06 UTC · model grok-4.3

classification 💻 cs.DC
keywords continuous benchmarkingautomated pipelinehigh performance computingreproducibilityresearch software engineeringneuroscienceartificial intelligence
0
0 comments X

The pith

An automated benchmarking pipeline with continuous integration features enables reproducible and reusable results for evolving HPC systems and models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents concepts for an automated benchmarking pipeline drawn from continuous integration practices for high-performance applications. It extends prior work on systematic benchmarking workflows by adding user-agnostic operations and continuous benchmarking to support customization and collaboration in research software development. These additions aim to foster reproducibility and re-use of results amid rapid changes in large-scale models and computing technologies. The approach targets sustainable progress particularly in neuroscience and artificial intelligence domains.

Core claim

The central claim is that concepts of an automated benchmarking pipeline, incorporating user-agnostic operations and continuous benchmarking inspired by continuous integration, can be implemented to foster reproducibility and re-use of benchmarking results for high performance applications, allowing the community to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.

What carries the argument

The automated benchmarking pipeline extended with user-agnostic operations and continuous features, designed to support customization, collaboration, and re-use.

If this is right

  • Reproducibility of benchmarking results increases through automation and continuous monitoring.
  • Re-use of results across community efforts supports sustainable technological progress in HPC.
  • Customization options allow adaptation to specific research software needs in neuroscience and AI.
  • Collaboration is facilitated by user-agnostic operations that reduce barriers for contributors.
  • The pipeline helps maintain pace with rapid changes in models and computing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Integration of this pipeline with existing continuous integration platforms could lower the barrier for smaller research teams to adopt systematic benchmarking.
  • Continuous benchmarking might enable earlier detection of performance issues when new hardware or model versions are introduced.
  • The emphasis on re-use could lead to shared repositories of benchmark results that reduce redundant computations across institutions.
  • Adoption in other scientific domains beyond neuroscience and AI would test the generality of the user-agnostic design.

Load-bearing premise

That the described automated benchmarking pipeline can be realized with user-agnostic operations and continuous features in a way that actually delivers customization, collaboration, and re-use without further technical specification or validation.

What would settle it

A controlled test showing that the pipeline produces no measurable gains in reproducibility or result re-use compared to standard manual benchmarking workflows on an evolving neuroscience model would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.15919 by Anno C. Kurth, Catherine Mia Sch\"ofmann, Dennis Terhorst, Hans Ekkehard Plesser, Jan Vogelsang, Johanna Senk, Jos\'e Villamar, Markus Diesmann, Melissa Lober, Susanne Kunkel.

Figure 1
Figure 1. Figure 1: Overview of the continuous benchmarking process. Researchers specify their experiments via configurations [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The template instantiation process starts with a workflow definition (1), specifying the individual stages of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Division of responsibilities of configuration and template setup in a research group. Left: Each part of the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Weak-scaling performance of the HPC-Benchmark model on JURECA-DC using 2 MPI processes per node [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Strong-scaling performance of the microcircuit model on JURECA-DC using the same setup and display as [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Strong-scaling performance of the multi-area model on JURECA-DC using the same setup and display as [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Differences in spike delivery time for a weak scaling of the HPC-Benchmark model on JURECA-DC using [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of research-software development as a continuous community effort. We have extended our previous conceptual work on systematic benchmarking workflows with the functionality of user-agnostic operations as well as continuous benchmarking. This fosters reproducibility and re-use of benchmarking results to ensure sustainable technological progress. We provide software-engineering solutions to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents concepts for an automated benchmarking pipeline for high-performance applications, drawing on continuous integration ideas. Customization and collaboration are highlighted as core design goals for research software development. It extends prior conceptual work on systematic benchmarking workflows by incorporating user-agnostic operations and continuous benchmarking to promote reproducibility, re-use of results, and sustainable progress amid rapid evolution of large-scale models and HPC systems, with a focus on neuroscience and AI domains.

Significance. If realized, the concepts could help address the challenge of maintaining relevant benchmarks in rapidly changing HPC and AI ecosystems by enabling ongoing, community-oriented evaluation. The emphasis on user-agnostic features and CI analogies offers a potentially useful framework for reproducibility, though the absence of concrete mechanisms or validation means the significance remains prospective rather than demonstrated.

major comments (2)
  1. Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.
  2. Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.
minor comments (1)
  1. The abstract invokes CI analogies but does not clarify how they map to benchmarking specifics, which could be clarified for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing concepts for a continuous benchmarking pipeline. We appreciate the acknowledgment of the potential impact in addressing challenges in evolving HPC and AI ecosystems. We address each major comment below and have made revisions to the manuscript to clarify our conceptual contributions.

read point-by-point responses
  1. Referee: Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.

    Authors: We agree that the abstract, being concise, does not fully elaborate on these aspects. In the body of the manuscript, user-agnostic operations are defined as benchmarking steps that operate independently of individual user environments, relying instead on standardized interfaces and shared resources. The data model for results incorporates versioning to handle model and system evolution, ensuring that customizations are preserved through modular, dependency-free configurations. We will revise the abstract to briefly include these definitions and highlight the handling of evolution, thereby supporting the claim more explicitly. revision: yes

  2. Referee: Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.

    Authors: As the manuscript presents a conceptual framework rather than an implemented system, we intentionally focused on high-level ideas drawn from continuous integration practices. However, we recognize that providing a high-level architecture diagram and workflow examples would aid evaluation. We will include these in the revised manuscript, along with a discussion of feasibility based on our prior systematic benchmarking workflows. Full empirical validation of the continuous features is planned for future work but is outside the scope of this conceptual paper. revision: partial

Circularity Check

0 steps flagged

Conceptual proposal with minor self-reference to prior work; no derivation or prediction reduces to inputs

full rationale

The manuscript is a high-level conceptual paper that extends the authors' previous work on systematic benchmarking workflows by adding user-agnostic operations and continuous benchmarking features, drawing analogies to continuous integration. No equations, fitted parameters, derivations, or quantitative predictions appear in the provided text or abstract. The self-reference to prior conceptual work serves only as background for the proposed extension and is not invoked to establish uniqueness, forbid alternatives, or force a result by construction. All claims about reproducibility, re-use, and sustainable progress remain design goals without reduction to self-definitional or fitted elements, rendering the proposal self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a high-level conceptual proposal with no mathematical content, empirical results, or derivations in the abstract, resulting in an empty ledger.

pith-pipeline@v0.9.0 · 5431 in / 962 out tokens · 31589 ms · 2026-05-10T08:06:18.844818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages

  1. [1]

    Software in Science Is Ubiquitous yet Overlooked

    A. Hocquet et al. “Software in Science Is Ubiquitous yet Overlooked”. Nature Computational Science (2024)

  2. [2]

    Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Imperial College COVID- 19 Response Team

    N. M. Ferguson et al. “Impact of non-pharmaceutical interventions (NPIs) to reduce COVID-19 mortality and healthcare demand. Imperial College COVID- 19 Response Team” (2020)

  3. [3]

    MODELLING THE PANDEMIC The simulations driving the world’s response to COVID- 19

    D. Adam. “MODELLING THE PANDEMIC The simulations driving the world’s response to COVID- 19”. Nature 580.7803 (2020), pp. 316–318. 15

  4. [4]

    Lee et al

    H. Lee et al. Climate change 2023: synthesis report. Contribution of working groups I, II and III to the sixth assessment report of the intergovernmental panel on climate change . Tech. rep. 2023

  5. [5]

    The research software engineer

    R. Baxter et al. “The research software engineer”. In: Digital Research 2012. Oxford, United King- dom, 2012, pp. 1–3

  6. [6]

    Research Software Engi- neering

    R. Speck and C. Wyatt. “Research Software Engi- neering”. In: RWTH Themen - Research Software Engineering. Ed. by M. Diesmann, J. Kowalski, and B. Rumpe. Vol. 1/2024. RWTH Themen. 2024, pp. 8–10

  7. [7]

    W. L. Oberkampf and C. J. Roy. Verification and Validation in Scientific Computing. Cambridge Uni- versity Press, 2010

  8. [8]

    Continuous Integration, Delivery and Deployment: A System- atic Review on Approaches, Tools, Challenges and Practices

    M. Shahin, M. Ali Babar, and L. Zhu. “Continuous Integration, Delivery and Deployment: A System- atic Review on Approaches, Tools, Challenges and Practices”. IEEE Access 5 (2017), 3909–3943

  9. [9]

    More Is Different

    P. W. Anderson. “More Is Different”.Science 177.4047 (1972), pp. 393–396

  10. [10]

    Scalability of Asynchronous Networks Is Limited by One-to-One Mapping between Effective Con- nectivity and Correlations

    S. J. van Albada, M. Helias, and M. Diesmann. “Scalability of Asynchronous Networks Is Limited by One-to-One Mapping between Effective Con- nectivity and Correlations”. PLOS Computational Biology 11.9 (2015). Ed. by P. E. Latham, e1004490

  11. [11]

    A Modular Workflow for Perfor- mance Benchmarking of Neuronal Network Simu- lations

    J. Albers et al. “A Modular Workflow for Perfor- mance Benchmarking of Neuronal Network Simu- lations”. Frontiers in Neuroinformatics 16 (2022), p. 837549

  12. [12]

    NEST (NEural Simulation Tool)

    M.-O. Gewaltig and M. Diesmann. “NEST (NEural Simulation Tool)”. Scholarpedia Journal 2.4 (2007), p. 1430

  13. [13]

    Metadata practices for simula- tion workflows

    J. Villamar et al. “Metadata practices for simula- tion workflows”. Scientific Data 12.1 (2025), pp. 1– 18

  14. [14]

    “You Don’t Meet Anybody When Walking from the Living Room to the Kitchen

    B. Viererbl, N. Denner, and T. Koch. ““You Don’t Meet Anybody When Walking from the Living Room to the Kitchen”: Informal Communication during Remote Work”. Journal of Communication Management 26.3 (2022), pp. 331–348

  15. [15]

    Leveraging DevOps for scientific computing

    P. Nuyujukian. “Leveraging DevOps for scientific computing”. arXiv preprint arXiv:2310.08247 (2023)

  16. [16]

    GOVERNING BOARD OF THE EuroHPC JOINT UNDERTAKING No 11/2024 Amending the Joint Undertaking’s Work Programme and Budget for the year 2024 (Amendment no 1)

    J. U. EuroHPC. “GOVERNING BOARD OF THE EuroHPC JOINT UNDERTAKING No 11/2024 Amending the Joint Undertaking’s Work Programme and Budget for the year 2024 (Amendment no 1)”. ref. EC Regulations (EU) 2018/1488 and (EU) 2021/1173 (2024)

  17. [17]

    Towards continuous benchmarking: An automated performance evaluation framework for high performance software

    H. Anzt et al. “Towards continuous benchmarking: An automated performance evaluation framework for high performance software”. In: Proceedings of the platform for advanced scientific computing conference. 2019, pp. 1–11

  18. [18]

    Towards Collaborative Continu- ous Benchmarking for HPC

    O. Pearce et al. “Towards Collaborative Continu- ous Benchmarking for HPC”. In: Proceedings of the SC ’23 Workshops of the International Confer- ence on High Performance Computing, Network, Storage, and Analysis. SC-W ’23. New York, NY, USA: Association for Computing Machinery, 2023, 627–635

  19. [19]

    Badwaik et al

    J. Badwaik et al. exaCB: Reproducible Continu- ous Benchmark Collections at Scale Leveraging an Incremental Approach. 2026

  20. [20]

    Terhorst et al

    D. Terhorst et al. NEST Conference 2024 Contri- butions. 2024

  21. [21]

    The impact of continuous inte- gration on other software development practices: a large-scale empirical study

    Y. Zhao et al. “The impact of continuous inte- gration on other software development practices: a large-scale empirical study”. In: 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE. 2017, pp. 60–71

  22. [22]

    Hayot-Sasson et al

    V. Hayot-Sasson et al. Addressing Reproducibility Challenges in HPC with Continuous Integration . 2025

  23. [23]

    JUSUF: Modular Tier-2 Su- percomputing and Cloud Infrastructure at J¨ ulich Supercomputing Centre

    B. Von St. Vieth. “JUSUF: Modular Tier-2 Su- percomputing and Cloud Infrastructure at J¨ ulich Supercomputing Centre”. J. Large-scale Res. Facil. JLSRF 7.A179 (2021)

  24. [24]

    JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at J¨ ulich Supercomputing Centre

    P. Th¨ ornig and B. von St. Vieth. “JURECA: Data Centric and Booster Modules implementing the Modular Supercomputing Architecture at J¨ ulich Supercomputing Centre”. JLSRF 7 (2021), A182

  25. [25]

    JUWELS: Modular Tier-0/1 Super- computer at J¨ ulich Supercomputing Centre

    D. Krause. “JUWELS: Modular Tier-0/1 Super- computer at J¨ ulich Supercomputing Centre”. J. Large-scale Res. Facil. JLSRF 5.A135 (2019)

  26. [26]

    Application-Driven Exascale: The JUPITER Benchmark Suite

    A. Herten et al. “Application-Driven Exascale: The JUPITER Benchmark Suite”. In: SC24: Interna- tional Conference for High Performance Computing, Networking, Storage and Analysis . 2024, pp. 1–45

  27. [27]

    Extremely Scalable Spiking Neu- ronal Network Simulation Code: From Laptops to Exascale Computers

    J. Jordan et al. “Extremely Scalable Spiking Neu- ronal Network Simulation Code: From Laptops to Exascale Computers”. Frontiers in Neuroinformat- ics 12 (2018), p. 2

  28. [28]

    Spiking network simulation code for petascale computers

    S. Kunkel et al. “Spiking network simulation code for petascale computers”. Frontiers in Neuroinfor- matics 8 (2014), p. 78

  29. [29]

    The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model

    T. C. Potjans and M. Diesmann. “The Cell-Type Specific Cortical Microcircuit: Relating Structure and Activity in a Full-Scale Spiking Network Model”. Cerebral Cortex 24.3 (2014), pp. 785–806

  30. [30]

    A multi-scale layer-resolved spiking network model of resting-state dynamics in macaque visual cortical areas

    M. Schmidt et al. “A multi-scale layer-resolved spiking network model of resting-state dynamics in macaque visual cortical areas”. PLOS Computa- tional Biology 14.10 (2018), e1006359

  31. [31]

    Haug et al

    N. Haug et al. NEST 3.5. Version 3.5. 2023

  32. [32]

    Graber et al

    S. Graber et al. NEST 3.8. Version 3.8. 2024. 16

  33. [33]

    Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex

    S. J. van Albada et al. “Usage and Scaling of an Open-Source Spiking Multi-Area Model of Monkey Cortex”. In: Lecture Notes in Computer Science . Cham, Switzerland: Springer International Publish- ing, 2021, pp. 47–59

  34. [34]

    Routing Brain Traffic Through the Von Neumann Bottleneck: Parallel Sorting and Refactoring

    J. Pronold et al. “Routing Brain Traffic Through the Von Neumann Bottleneck: Parallel Sorting and Refactoring”. Frontiers in Neuroinformatics 15 (2022), p. 785068

  35. [35]

    Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers

    J. Pronold et al. “Routing brain traffic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers”. Parallel computing 113 (2022), p. 102952

  36. [36]

    Sub-realtime simulation of a neu- ronal network of natural density

    A. C. Kurth et al. “Sub-realtime simulation of a neu- ronal network of natural density”. Neuromorphic Computing and Engineering 2.2 (2022), p. 021001

  37. [37]

    D. O. Hebb. The organization of behavior: A neu- ropsychological theory. New York: John Wiley & Sons, 1949

  38. [38]

    Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynap- tic Cell Type

    G. Bi and M. Poo. “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on Spike Timing, Synaptic Strength, and Postsynap- tic Cell Type”. Journal of Neuroscience 18 (1998), pp. 10464–10472

  39. [39]

    Precise spike timing with exact subthreshold integration in discrete time network simulations

    A. Morrison et al. “Precise spike timing with exact subthreshold integration in discrete time network simulations”. In: Proceedings of the 30th G¨ ottingen Neurobiology Conference. 2005, 205B

  40. [40]

    Spike- Timing Dependent Plasticity in Balanced Ran- dom Networks

    A. Morrison, A. Aertsen, and M. Diesmann. “Spike- Timing Dependent Plasticity in Balanced Ran- dom Networks”. Neural Computation 19 (2007), pp. 1437–1467

  41. [41]

    The role of metadata in repro- ducible computational research

    J. Leipzig et al. “The role of metadata in repro- ducible computational research”.Patterns 2.9 (2021)

  42. [42]

    Editorial: Neuroscience, com- puting, performance, and benchmarks: Why it mat- ters to neuroscience how fast we can compute

    J. B. Aimone et al. “Editorial: Neuroscience, com- puting, performance, and benchmarks: Why it mat- ters to neuroscience how fast we can compute”. Frontiers in Neuroinformatics 17 (2023)

  43. [43]

    Pronold et al

    J. Pronold et al. Code for ”Routing brain traf- fic through the von Neumann bottleneck: Efficient cache usage in spiking neural network simulation code on general purpose computers”. Version version

  44. [44]

    Phe- nomenological models of synaptic plasticity based on spike-timing

    A. Morrison, M. Diesmann, and W. Gerstner. “Phe- nomenological models of synaptic plasticity based on spike-timing”. Biological Cybernetics 98.6 (2008), pp. 459–478

  45. [45]

    A Fast, Compact Approxima- tion of the Exponential Function

    N. N. Schraudolph. “A Fast, Compact Approxima- tion of the Exponential Function”. Neural Compu- tation 11.4 (1999), pp. 853–862

  46. [46]

    On a Fast, Compact Approximation of the Exponential Function

    G. C. Cawley. “On a Fast, Compact Approximation of the Exponential Function”. Neural Computation 12 (2000), pp. 2009–2012

  47. [47]

    Fast exponential compu- tation on simd architectures

    A. C. I. Malossi et al. “Fast exponential compu- tation on simd architectures”. Proc. of HIPEAC- WAPCO, Amsterdam NL 56 (2015), p. 224

  48. [48]

    Accelerating Event-Driven Sim- ulation of Spiking Neurons with Multiple Synaptic Time Constants

    M. D’Haene et al. “Accelerating Event-Driven Sim- ulation of Spiking Neurons with Multiple Synaptic Time Constants”. Neural Computation 21.4 (2009), pp. 1068–1099

  49. [49]

    A fixed point exponential func- tion accelerator for a neuromorphic many-core sys- tem

    J. Partzsch et al. “A fixed point exponential func- tion accelerator for a neuromorphic many-core sys- tem”. In: 2017 IEEE International Symposium on Circuits and Systems (ISCAS) . 2017, pp. 1–4

  50. [50]

    Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neu- ral Network Simulations

    A. Morrison et al. “Exact Subthreshold Integration with Continuous Spike Times in Discrete-Time Neu- ral Network Simulations”.Neural Computation 19.1 (2007), pp. 47–79

  51. [51]

    A general and efficient method for incorporating precise spike times in globally time-driven simulations

    A. Hanuschkin et al. “A general and efficient method for incorporating precise spike times in globally time-driven simulations”. Frontiers in Neuroinfor- matics 4 (2010), p. 113. 17 5 Supplementary Information SI 1.1 Use case: Barrier-free spike delivery Before the 5g simulation kernel, all spikes were always communicated in one go, independent of the num...