pith. sign in

arxiv: 2606.07667 · v1 · pith:NKLS2CISnew · submitted 2026-06-04 · ⚛️ physics.data-an · nucl-ex· physics.ins-det

Design Principles for AI-Ready QCD Data with a Barrel Imaging Calorimeter Application

Pith reviewed 2026-06-27 23:01 UTC · model grok-4.3

classification ⚛️ physics.data-an nucl-exphysics.ins-det
keywords AI-ready dataQCD dataunified schemaBarrel Imaging Calorimeterheterogeneous detectorsdata preparation pipelineePIC detector
0
0 comments X

The pith

A unified schema organizes heterogeneous QCD detector data from different technologies into one structure for AI applications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a design framework that defines a single data structure to handle the varied readouts produced by different detector technologies in QCD collider experiments. It applies this framework to simulated data from the Barrel Imaging Calorimeter in the ePIC detector, which combines silicon pixel imaging layers with calorimeter layers. The resulting schema, preparation pipeline, and visualization make the data suitable for cross-experiment AI use without technology-specific adjustments. A sympathetic reader would see this as enabling models trained on one detector's output to transfer more readily to another.

Core claim

The paper claims that a design framework for AI-ready QCD data produces a unified data structure capable of accommodating heterogeneous detector technologies within a single schema, demonstrated through its application to the simulated Barrel Imaging Calorimeter data that integrates AstroPix silicon pixel imaging layers with Pb/ScFi calorimeter layers across different readout types.

What carries the argument

The unified data schema that standardizes heterogeneous detector readouts from different technologies into a common format for AI applications.

If this is right

  • AI models can train on data from one QCD experiment and apply to others without per-technology retraining.
  • Data preparation and visualization pipelines become reusable across different calorimeter and imaging layer combinations.
  • The schema handles both pixel imaging and calorimeter readouts in the BIC simulation within one structure.
  • Cross-experiment comparisons of AI performance become possible on standardized inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detector design teams could adopt the schema early to simplify future AI analysis.
  • The approach might apply to non-collider experiments that also face mixed readout technologies.
  • Standardization could lower barriers for external groups to contribute to QCD data analysis.

Load-bearing premise

That the heterogeneity of detector readouts and their technology dependence can be addressed through principled curation into a single schema that remains useful for cross-experiment AI applications.

What would settle it

Demonstrating that data from a second detector technology cannot be mapped into the same schema without losing essential information or requiring major custom changes would falsify the unification claim.

Figures

Figures reproduced from arXiv: 2606.07667 by Chun Yuen Tsang, Maria Zurek, Minho Kim, Sylvester Joosten, Zhiwan Xu.

Figure 1
Figure 1. Figure 1: FIG. 1. Global transverse ( [PITH_FULL_IMAGE:figures/full_fig_p017_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Distribution of the number of MC particle contributions per hit for the two BIC subsystems. [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

Data from large physics collider experiments in Quantum Chromodynamics (QCD) research differ fundamentally from the modalities used in modern foundation models. The heterogeneity of detector readouts and their technology dependence require principled curation for cross experiment AI applications. We present a design framework for AI-ready QCD data to define a unified data structure that accommodates heterogeneous detector technologies within a single schema. We apply the design principle to the simulated data of the Barrel Imaging Calorimeter (BIC) in the ePIC detector at the Electron--Ion Collider. The BIC simulation data combines AstroPix silicon pixel imaging layers with Pb/ScFi calorimeter layers across different readout types. We describe the schema specialization, data preparation pipeline, and visualization of the curated AI-ready dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a design framework for creating AI-ready QCD data from collider experiments, defining a unified data structure and schema intended to accommodate heterogeneous detector technologies across experiments. It applies the framework to simulated Barrel Imaging Calorimeter (BIC) data from the ePIC detector at the Electron-Ion Collider, combining AstroPix silicon pixel imaging layers with Pb/ScFi calorimeter layers, and describes the resulting schema specialization, data preparation pipeline, and visualizations of the curated dataset.

Significance. If the proposed unified schema can be shown to generalize without per-experiment re-engineering, the work would provide a valuable contribution to standardizing heterogeneous high-energy physics data for modern AI methods, potentially enabling more scalable cross-experiment analyses and foundation-model applications in QCD research. The concrete BIC example illustrates practical curation steps for mixed readout types.

major comments (2)
  1. [Abstract and application to BIC] Abstract and application section: the central claim that the framework defines a unified schema 'that accommodates heterogeneous detector technologies within a single schema' and is 'useful for cross-experiment AI applications' rests on a single simulated BIC/ePIC case (AstroPix pixels + Pb/ScFi layers). No second detector technology, no real (non-simulated) data, and no explicit portability test are described, leaving the load-bearing generality assumption untested.
  2. [Abstract] Abstract: the manuscript states the framework and its BIC application but reports no validation metrics, performance benchmarks, error analysis, or comparison against alternative data schemas or curation approaches. This absence weakens support for the claim that the schema is AI-ready and cross-experiment viable.
minor comments (1)
  1. The manuscript would benefit from explicit section numbering and clearer delineation between the general design principles and the BIC-specific specialization to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the scope of the design-focused manuscript while acknowledging its limitations.

read point-by-point responses
  1. Referee: [Abstract and application to BIC] Abstract and application section: the central claim that the framework defines a unified schema 'that accommodates heterogeneous detector technologies within a single schema' and is 'useful for cross-experiment AI applications' rests on a single simulated BIC/ePIC case (AstroPix pixels + Pb/ScFi layers). No second detector technology, no real (non-simulated) data, and no explicit portability test are described, leaving the load-bearing generality assumption untested.

    Authors: The manuscript presents a set of design principles for a unified schema that can accommodate heterogeneous readout types (as demonstrated by combining AstroPix pixels and Pb/ScFi calorimeter layers in the BIC). The BIC application illustrates the specialization process for mixed technologies within one schema. We agree that the work relies on a single simulated example and does not include a second detector or real data. We will revise the abstract and discussion sections to explicitly frame the contribution as a proposed design framework with an initial application, noting that cross-experiment portability remains to be tested in future work. revision: partial

  2. Referee: [Abstract] Abstract: the manuscript states the framework and its BIC application but reports no validation metrics, performance benchmarks, error analysis, or comparison against alternative data schemas or curation approaches. This absence weakens support for the claim that the schema is AI-ready and cross-experiment viable.

    Authors: This is a design paper focused on defining principles and the resulting schema for AI-ready data curation, not on downstream AI model performance. No quantitative benchmarks or comparisons are included because the contribution centers on the data structure and preparation pipeline itself. The visualizations and pipeline description are intended to support the AI-readiness claim at the conceptual level. Adding empirical validation would require model training experiments outside the current scope. revision: no

Circularity Check

0 steps flagged

No circularity; design framework is definitional with no derivations or fitted predictions.

full rationale

The paper presents a design framework for unifying heterogeneous detector data into a single schema and demonstrates its application to simulated BIC/ePIC data. No equations, parameter fits, predictions, or derivation chains appear in the provided text. The central claim is the definition and curation process itself rather than any result derived from prior outputs or self-citations. The single-detector application is presented as an example, not as a statistically forced prediction, so no reduction to inputs by construction occurs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review; ledger populated from stated motivations in the abstract with no further details available on parameters or supporting evidence.

axioms (1)
  • domain assumption Data from large physics collider experiments in QCD research differ fundamentally from the modalities used in modern foundation models due to heterogeneity of detector readouts and technology dependence.
    Explicitly stated in the abstract as the core problem motivating the framework.
invented entities (1)
  • Unified data structure for AI-ready QCD data no independent evidence
    purpose: To accommodate heterogeneous detector technologies within a single schema for cross-experiment AI applications.
    Central proposed artifact described in the abstract.

pith-pipeline@v0.9.1-grok · 5666 in / 1295 out tokens · 30747 ms · 2026-06-27T23:01:47.598106+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 2 linked inside Pith

  1. [1]

    AI-ready data array 11

  2. [2]

    Data Preparation and Format 15 C

    Truth Labeling 13 B. Data Preparation and Format 15 C. Visualization 16 V. Summary and Outlook 17 Acknowledgments 18 A. ePIC BIC Dataset Metadata 19

  3. [3]

    Metadata for themeasurementsArray 19

  4. [4]

    Metadata for thelabelsArray 21

  5. [5]

    INTRODUCTION Nuclear and particle physics detectors record particle interactions as heterogeneous elec- tronic signals

    Metadata for theevent labelArray 24 References 26 2 I. INTRODUCTION Nuclear and particle physics detectors record particle interactions as heterogeneous elec- tronic signals. A single collision event produces thousands of readouts across multiple subde- tectors, each with distinct geometry, segmentation, readout technology, dynamic range, and noise charac...

  6. [6]

    First, the detectorfield is implemented assubsystem, with index 0 for AstroPix and 1 for Pb/ScFi

    AI-ready data array Applying themeasurementsschema to the BIC requires four adaptations. First, the detectorfield is implemented assubsystem, with index 0 for AstroPix and 1 for Pb/ScFi. Because these two subsystems differ in spatial resolution, timing semantics, and signal physics, AI models should read separate representations for each, for example by t...

  7. [7]

    Truth Labeling Applying thelabelsarray to the BIC simulation data requires a few adaptations. For calorimeter truth labeling, each shower secondary produced within the BIC is traced back to the contributing particle, defined as the stable generator-level particle that entered the BIC volume and initiated the corresponding shower. It represents the truth l...

  8. [8]

    not yet assigned

    Metadata for themeasurementsArray •event:uint64integer index, dimensionless. A unique identifier for each physics event, corresponding to one simulated electron–proton collision. All rows from both AstroPix and Pb/ScFi subsystems belonging to the same collision share this index. This is a grouping key for batching, not a learnable feature. Rows are sorted...

  9. [9]

    Matches theeventindex inmeasurements

    Metadata for thelabelsArray •event:uint64integer index, dimensionless. Matches theeventindex inmeasurements. This is a join key: rows with the sameeventinmeasurementsandlabelsbelong to the same collision. Not a learnable feature. •subsystem:uint16categorical index, dimensionless. Matches thesubsystemfield inmeasurements: 0 = AstroPix, 1 = Pb/ScFi. This is...

  10. [10]

    Same event index as inmeasurements andlabels, identifying the collision to which this particle belongs

    Metadata for theevent labelArray •event:uint64integer index, dimensionless. Same event index as inmeasurements andlabels, identifying the collision to which this particle belongs. This is a join key: use it together withparticleto look up a specific MC particle from a specific event. •particle:uint16integer index, dimensionless. Sequential index of the pa...

  11. [11]

    EDM4hep: A common event data model for HEP experiments,

    F. Gaedeet al., “EDM4hep: A common event data model for HEP experiments,”https: //github.com/key4hep/EDM4hep(2024)

  12. [12]

    Chenet al., Sci

    Y. Chenet al., Sci. Data9, 31 (2022), arXiv:2108.02214 [hep-ph]

  13. [13]

    Kansal, J

    R. Kansal, J. Duarte, H. Su, B. Orzari, T. Tomei, M. Pierini, M. Touranakou, J.-R. Vlimant, and D. Gunopulos, Adv. Neural Inf. Process. Syst.34, 23858 (2021), arXiv:2106.11535 [cs.LG]

  14. [14]

    Amroucheet al., The Springer Series on Challenges in Machine Learning , 231 (2020), arXiv:1904.06778 [hep-ex]

    S. Amroucheet al., The Springer Series on Challenges in Machine Learning , 231 (2020), arXiv:1904.06778 [hep-ex]

  15. [15]

    Buckley, P

    A. Buckley, P. Ilten, D. Konstantinov, L. L¨ onnblad, J. Monk, W. Pokorski, T. Przedzinski, M. Posocco, P. Ruiz-Femenia, and Q. Zheng, Comput. Phys. Commun.260, 107310 (2021), arXiv:1912.08005 [hep-ph]

  16. [16]

    R. L. Workmanet al.(Particle Data Group), PTEP2022, 083C01 (2022)

  17. [17]

    Abdul Khaleket al., Nucl

    R. Abdul Khaleket al., Nucl. Phys. A1026, 122447 (2022), arXiv:2103.05419 [physics.ins-det]

  18. [18]

    Kimet al., inProceedings of the 34th International Workshop on Vertex Detectors (VER- TEX 2025), Vol

    B. Kimet al., inProceedings of the 34th International Workshop on Vertex Detectors (VER- TEX 2025), Vol. VERTEX2025 (2025) p. 031, arXiv:2511.05639 [physics.ins-det]

  19. [19]

    Klestet al., JINST20, P07028 (2025), arXiv:2504.03079 [physics.ins-det]

    H. Klestet al., JINST20, P07028 (2025), arXiv:2504.03079 [physics.ins-det]

  20. [20]

    Sj¨ ostrand, S

    T. Sj¨ ostrand, S. Ask, J. R. Christiansen, R. Corke, N. Desai, P. Ilten, S. Mrenna, S. Pres- tel, C. O. Rasmussen, and P. Z. Skands, Comput. Phys. Commun.191, 159 (2015), arXiv:1410.3012 [hep-ph]

  21. [21]

    Agostinelliet al.(GEANT4), Nucl

    S. Agostinelliet al.(GEANT4), Nucl. Instrum. Meth. A506, 250 (2003)

  22. [22]

    Frank, F

    M. Frank, F. Gaede, C. Grefe, and P. Mato, J. Phys. Conf. Ser.513, 022010 (2014). 26

  23. [23]

    EICrecon: EIC Reconstruction Software,

    ePIC Collaboration, “EICrecon: EIC Reconstruction Software,”https://github.com/eic/ EICrecon(2024). [14]https://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf. [15]https://pdg.lbl.gov/2007/reviews/montecarlorpp.pdf. 27