Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies

Anno C. Kurth; Catherine Mia Sch\"ofmann; Dennis Terhorst; Hans Ekkehard Plesser; Jan Vogelsang; Johanna Senk; Jos\'e Villamar; Markus Diesmann; Melissa Lober; Susanne Kunkel

arxiv: 2604.15919 · v3 · pith:MUHGYXO5new · submitted 2026-04-17 · 💻 cs.DC

Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies

Jan Vogelsang , Melissa Lober , Catherine Mia Sch\"ofmann , Jos\'e Villamar , Dennis Terhorst , Johanna Senk , Hans Ekkehard Plesser , Markus Diesmann

show 2 more authors

Susanne Kunkel Anno C. Kurth

This is my paper

Pith reviewed 2026-05-10 08:06 UTC · model grok-4.3

classification 💻 cs.DC

keywords continuous benchmarkingautomated pipelinehigh performance computingreproducibilityresearch software engineeringneuroscienceartificial intelligence

0 comments

The pith

An automated benchmarking pipeline with continuous integration features enables reproducible and reusable results for evolving HPC systems and models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents concepts for an automated benchmarking pipeline drawn from continuous integration practices for high-performance applications. It extends prior work on systematic benchmarking workflows by adding user-agnostic operations and continuous benchmarking to support customization and collaboration in research software development. These additions aim to foster reproducibility and re-use of results amid rapid changes in large-scale models and computing technologies. The approach targets sustainable progress particularly in neuroscience and artificial intelligence domains.

Core claim

The central claim is that concepts of an automated benchmarking pipeline, incorporating user-agnostic operations and continuous benchmarking inspired by continuous integration, can be implemented to foster reproducibility and re-use of benchmarking results for high performance applications, allowing the community to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.

What carries the argument

The automated benchmarking pipeline extended with user-agnostic operations and continuous features, designed to support customization, collaboration, and re-use.

If this is right

Reproducibility of benchmarking results increases through automation and continuous monitoring.
Re-use of results across community efforts supports sustainable technological progress in HPC.
Customization options allow adaptation to specific research software needs in neuroscience and AI.
Collaboration is facilitated by user-agnostic operations that reduce barriers for contributors.
The pipeline helps maintain pace with rapid changes in models and computing systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Integration of this pipeline with existing continuous integration platforms could lower the barrier for smaller research teams to adopt systematic benchmarking.
Continuous benchmarking might enable earlier detection of performance issues when new hardware or model versions are introduced.
The emphasis on re-use could lead to shared repositories of benchmark results that reduce redundant computations across institutions.
Adoption in other scientific domains beyond neuroscience and AI would test the generality of the user-agnostic design.

Load-bearing premise

That the described automated benchmarking pipeline can be realized with user-agnostic operations and continuous features in a way that actually delivers customization, collaboration, and re-use without further technical specification or validation.

What would settle it

A controlled test showing that the pipeline produces no measurable gains in reproducibility or result re-use compared to standard manual benchmarking workflows on an evolving neuroscience model would falsify the claim.

Figures

Figures reproduced from arXiv: 2604.15919 by Anno C. Kurth, Catherine Mia Sch\"ofmann, Dennis Terhorst, Hans Ekkehard Plesser, Jan Vogelsang, Johanna Senk, Jos\'e Villamar, Markus Diesmann, Melissa Lober, Susanne Kunkel.

**Figure 1.** Figure 1: Overview of the continuous benchmarking process. Researchers specify their experiments via configurations [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The template instantiation process starts with a workflow definition (1), specifying the individual stages of [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Division of responsibilities of configuration and template setup in a research group. Left: Each part of the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Weak-scaling performance of the HPC-Benchmark model on JURECA-DC using 2 MPI processes per node [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Strong-scaling performance of the microcircuit model on JURECA-DC using the same setup and display as [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Strong-scaling performance of the multi-area model on JURECA-DC using the same setup and display as [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: Differences in spike delivery time for a weak scaling of the HPC-Benchmark model on JURECA-DC using [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

Drawing on ideas from continuous integration, we present concepts of an automated benchmarking pipeline for high performance applications. Customization and collaboration have been key design goals owing to the requirements of research-software development as a continuous community effort. We have extended our previous conceptual work on systematic benchmarking workflows with the functionality of user-agnostic operations as well as continuous benchmarking. This fosters reproducibility and re-use of benchmarking results to ensure sustainable technological progress. We provide software-engineering solutions to keep pace with the rapid evolution of both large-scale models and high-performance computing systems with a view towards the scientific domains of neuroscience and artificial intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents concepts for an automated benchmarking pipeline for high-performance applications, drawing on continuous integration ideas. Customization and collaboration are highlighted as core design goals for research software development. It extends prior conceptual work on systematic benchmarking workflows by incorporating user-agnostic operations and continuous benchmarking to promote reproducibility, re-use of results, and sustainable progress amid rapid evolution of large-scale models and HPC systems, with a focus on neuroscience and AI domains.

Significance. If realized, the concepts could help address the challenge of maintaining relevant benchmarks in rapidly changing HPC and AI ecosystems by enabling ongoing, community-oriented evaluation. The emphasis on user-agnostic features and CI analogies offers a potentially useful framework for reproducibility, though the absence of concrete mechanisms or validation means the significance remains prospective rather than demonstrated.

major comments (2)

Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.
Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.

minor comments (1)

The abstract invokes CI analogies but does not clarify how they map to benchmarking specifics, which could be clarified for better readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript describing concepts for a continuous benchmarking pipeline. We appreciate the acknowledgment of the potential impact in addressing challenges in evolving HPC and AI ecosystems. We address each major comment below and have made revisions to the manuscript to clarify our conceptual contributions.

read point-by-point responses

Referee: Abstract: The central claim that adding user-agnostic operations and continuous benchmarking to prior systematic workflows fosters reproducibility and re-use is load-bearing but unsupported, as the text provides no definitions of these operations, no data model for results, and no handling for model/system evolution that would demonstrate preservation of customization without hidden per-user dependencies.

Authors: We agree that the abstract, being concise, does not fully elaborate on these aspects. In the body of the manuscript, user-agnostic operations are defined as benchmarking steps that operate independently of individual user environments, relying instead on standardized interfaces and shared resources. The data model for results incorporates versioning to handle model and system evolution, ensuring that customizations are preserved through modular, dependency-free configurations. We will revise the abstract to briefly include these definitions and highlight the handling of evolution, thereby supporting the claim more explicitly. revision: yes
Referee: Abstract: No architecture, workflow examples, or feasibility analysis is given for the automated pipeline, leaving the assumption that continuous features can deliver collaboration and re-use unverified and making it impossible to evaluate whether the extension works as claimed.

Authors: As the manuscript presents a conceptual framework rather than an implemented system, we intentionally focused on high-level ideas drawn from continuous integration practices. However, we recognize that providing a high-level architecture diagram and workflow examples would aid evaluation. We will include these in the revised manuscript, along with a discussion of feasibility based on our prior systematic benchmarking workflows. Full empirical validation of the continuous features is planned for future work but is outside the scope of this conceptual paper. revision: partial

Circularity Check

0 steps flagged

Conceptual proposal with minor self-reference to prior work; no derivation or prediction reduces to inputs

full rationale

The manuscript is a high-level conceptual paper that extends the authors' previous work on systematic benchmarking workflows by adding user-agnostic operations and continuous benchmarking features, drawing analogies to continuous integration. No equations, fitted parameters, derivations, or quantitative predictions appear in the provided text or abstract. The self-reference to prior conceptual work serves only as background for the proposed extension and is not invoked to establish uniqueness, forbid alternatives, or force a result by construction. All claims about reproducibility, re-use, and sustainable progress remain design goals without reduction to self-definitional or fitted elements, rendering the proposal self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is a high-level conceptual proposal with no mathematical content, empirical results, or derivations in the abstract, resulting in an empty ledger.

pith-pipeline@v0.9.0 · 5431 in / 962 out tokens · 31589 ms · 2026-05-10T08:06:18.844818+00:00 · methodology

Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)