pith. sign in

arxiv: 2606.13535 · v3 · pith:HEFH5CPCnew · submitted 2026-06-11 · ✦ hep-ex · cs.AI· hep-ph

AgentRivet: an automated system for producing Rivet routines from journal publications

Pith reviewed 2026-06-27 04:48 UTC · model grok-4.3

classification ✦ hep-ex cs.AIhep-ph
keywords Rivet routinesanalysis preservationLarge Language Modelscollider measurementsautomated code generationfiducial observablesMonte Carlo comparison
0
0 comments X

The pith

Large language models can extract analysis details from papers and generate working Rivet routines with reasonable fidelity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an automated workflow that reads published collider measurements and produces the code libraries needed to compare new theoretical predictions against those measurements. Only 39 percent of such measurements currently have public Rivet routines, leaving many results inaccessible for model testing. The workflow proceeds through extraction of observables and cuts, code writing, and separate reviews of syntax and physics content. When applied to two recent measurements, the generated routines showed few syntax errors and generally reproduced the physics described in the source papers, although implementation problems still occurred in some cases.

Core claim

AgentRivet is a multi-step workflow based on Large Language Models that extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics-reviews as part of an autonomous quality control. Tests on two measurements show that the system produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications, although physics-implementation issues arise mainly from subtle ambiguities in the source papers.

What carries the argument

The multi-step LLM workflow that extracts observables, cuts and fiducial regions from papers, writes Rivet code, and performs autonomous code and physics reviews before output.

If this is right

  • Rivet coverage can rise above the current 39 percent of measurements.
  • New Monte Carlo models can be tested against a larger set of preserved analyses without manual coding.
  • Searches for physics beyond the Standard Model gain access to additional fiducial measurements.
  • Analysis preservation becomes less dependent on individual authors writing routines after publication.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may allow routine generation to keep pace with the growing volume of published measurements.
  • Ambiguous wording in papers will continue to require targeted human checks even after automation.
  • The workflow could be extended to flag papers whose definitions are too vague for reliable implementation.
  • Integration into journal submission processes might encourage clearer reporting of analysis details.

Load-bearing premise

Published papers contain sufficiently unambiguous definitions of observables, cuts and fiducial regions that LLMs can extract and implement them correctly without human clarification.

What would settle it

Running the generated Rivet routines on the original datasets and comparing the resulting distributions directly to the published measurement results would show whether the physics implementation is correct.

Figures

Figures reproduced from arXiv: 2606.13535 by Andrew D. Pilkington, Antonio J. Costa, Caterina Doglioni, Christian G\"utschow, Sukanya Sinha.

Figure 1
Figure 1. Figure 1: Summary of the AGENTRIVET workflow. The black arrows represent the transfer of information between different steps in the pipeline. The dotted lines rep￾resent the storage to Memory. complementary perspectives and produce structured review objects describing issues and sug￾gested corrections. The resulting review feedback is subsequently incorporated into the next iteration of the CODER prompt, allowing th… view at source ↗
Figure 2
Figure 2. Figure 2: Normalised differential cross-sections as a function of thrust (a,c,e) and [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Differential cross sections as a function of [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
read the original abstract

Particle physics collider experiments provide Rivet routines as part of the analysis preservation strategy for model-independent measurements. Rivet is a C++ toolkit that allow new theoretical models to be compared to the measurements, thus aiding the development and tuning of Monte Carlo event generators as well as searches for physics beyond the Standard Model. However, analysis coverage is known to be incomplete, with only 39% of measurements having documented and publicly available Rivet routines. In this article, we design and implement an automated workflow based on Large Language Models with the goal of providing the missing routines. This multi-step workflow, referred to as AgentRivet, extracts the physics analysis information from published papers and writes the missing Rivet routines, with intermediate code- and physics- reviews as part of an autonomous quality control. We report the results obtained using commercial Large Language Models, provided by OpenAI, Anthropic, and Google, for two recent measurements from the ATLAS and CMS experiments. We find that AgentRivet produces competent Rivet routines with few syntax errors. The physics fidelity of the routines is reasonable and follows the explanations given in the relevant publications. Nevertheless, physics-implementation issues do arise and are investigated using the artefacts produced by AgentRivet. The majority of physics implementation issues arise from subtle-but-ambiguous definitions in the given publication, although some models struggle to implement complex observables even when clear definitions are given.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces AgentRivet, an automated multi-step workflow using commercial LLMs (from OpenAI, Anthropic, and Google) to extract physics analysis details from journal publications and generate Rivet routines, incorporating intermediate code- and physics-review steps for quality control. It reports results from applying the system to two recent ATLAS and CMS measurements, claiming that the output routines are competent with few syntax errors and exhibit reasonable physics fidelity that follows the source publications, while noting that some implementation issues arise primarily from ambiguous definitions in the papers.

Significance. If the approach can be shown to scale reliably, it would address the known gap in Rivet coverage (currently only 39% of measurements) and thereby improve model-independent comparisons between theory and experiment, Monte Carlo tuning, and BSM searches. The multi-provider LLM strategy and autonomous review steps represent a constructive engineering effort toward automation; however, the present demonstration supplies no quantitative performance metrics and rests on a sample of two analyses, so the immediate significance remains limited.

major comments (2)
  1. [Abstract] Abstract and reported results: the central claim that AgentRivet 'produces competent Rivet routines' rests on qualitative judgments ('few syntax errors', 'reasonable' fidelity) for exactly two test cases, with no tabulated error counts, success rates, inter-rater agreement, or side-by-side comparison against human-written Rivet code. This small-N, non-metric evaluation is load-bearing for the claim and leaves its generality unestablished.
  2. [Workflow and results] Workflow description and results: the physics-review stage is performed by an LLM of the same class as the generation stage, yet the manuscript supplies neither frequency/severity statistics for the observed physics issues nor an external validation protocol. Without these, the assertion that issues 'arise from subtle-but-ambiguous definitions' cannot be quantified even within the two reported cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. Our work presents AgentRivet as a proof-of-concept demonstration on two analyses, and we agree the evaluation can be strengthened with additional quantitative details from the existing cases. We respond point-by-point below, indicating planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and reported results: the central claim that AgentRivet 'produces competent Rivet routines' rests on qualitative judgments ('few syntax errors', 'reasonable' fidelity) for exactly two test cases, with no tabulated error counts, success rates, inter-rater agreement, or side-by-side comparison against human-written Rivet code. This small-N, non-metric evaluation is load-bearing for the claim and leaves its generality unestablished.

    Authors: We agree the evaluation is qualitative and limited to two cases, consistent with the manuscript's framing as an initial demonstration of workflow feasibility rather than a broad benchmark. We will revise the abstract and results to include tabulated counts of syntax errors (e.g., compilation issues) and physics discrepancies per case, with explicit examples. Inter-rater agreement metrics are not applicable to this author-reviewed assessment. Side-by-side human comparisons are not included, as they require separate expert effort, but all generated routines and artefacts will be released to support such evaluations. The revisions will emphasize the proof-of-concept scope while making the reported observations more precise. revision: yes

  2. Referee: [Workflow and results] Workflow description and results: the physics-review stage is performed by an LLM of the same class as the generation stage, yet the manuscript supplies neither frequency/severity statistics for the observed physics issues nor an external validation protocol. Without these, the assertion that issues 'arise from subtle-but-ambiguous definitions' cannot be quantified even within the two reported cases.

    Authors: The physics-review LLM provides an initial automated check, but issue classification (ambiguous definitions versus model limitations) derives from the authors' subsequent manual inspection of outputs against the source papers. We will revise the workflow and results sections to add frequency and severity statistics for physics issues in both test cases, with categorization and examples. This will quantify the assertion within the reported scope. A formal external validation protocol is not implemented here and lies beyond the current demonstration; we will explicitly note this as a limitation and potential future extension. revision: yes

Circularity Check

0 steps flagged

No circularity: workflow and evaluation rest on external publications and standard Rivet framework

full rationale

The paper presents an LLM-driven workflow (AgentRivet) that extracts analysis details from published papers and generates Rivet code, with intermediate reviews. Its central claim is an empirical demonstration on two independent ATLAS/CMS measurements, asserting competent output and reasonable fidelity by direct comparison to the source publications' explanations. No equations, fitted parameters, or derivations are involved. No self-citations are invoked as load-bearing premises, and the evaluation does not reduce to any input by construction. The process is self-contained against external benchmarks (the original journal papers and the Rivet toolkit), satisfying the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The system depends on existing commercial LLM capabilities and the established Rivet framework; no new free parameters, axioms or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 6086 in / 952 out tokens · 60338 ms · 2026-06-27T04:48:17.668994+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    HEPData: a repository for high energy physics data

    Maguire, Eamonn and Heinrich, Lukas and Watt, Graeme. HEPData: a repository for high energy physics data. J. Phys. Conf. Ser. 2017. doi:10.1088/1742-6596/898/10/102006. arXiv:1704.05473

  2. [2]

    2024 , publisher=

    Christian Bierlich and others , journal=. 2024 , publisher=. doi:10.21468/SciPostPhysCodeb.36 , url=

  3. [3]

    Buckley and others , journal=

    A. Buckley and others , journal=. 2021 , publisher=. doi:10.21468/SciPostPhysCore.4.2.013 , url=

  4. [4]

    Measurement and interpretation of inclusive W production in proton-proton collisions at s =13 TeV using the ATLAS detector

    ATLAS Collaboration. Measurement and interpretation of inclusive W production in proton-proton collisions at s =13 TeV using the ATLAS detector. 2026. arXiv:2603.22478

  5. [5]

    Precise measurement of the t t production cross-section and lepton differential distributions in e dilepton events from s =13 TeV pp collisions with the ATLAS detector

    ATLAS Collaboration. Precise measurement of the t t production cross-section and lepton differential distributions in e dilepton events from s =13 TeV pp collisions with the ATLAS detector. Eur. Phys. J. C. 2026. doi:10.1140/epjc/s10052-026-15311-0. arXiv:2509.15066

  6. [6]

    Measurement of event shape variables using charged particles inside jets in proton-proton collisions at s = 13 TeV

    CMS Collaboration. Measurement of event shape variables using charged particles inside jets in proton-proton collisions at s = 13 TeV. 2026. arXiv:2602.17509

  7. [7]

    The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations

    Alwall, J. and others. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP. 2014. doi:10.1007/JHEP07(2014)079. arXiv:1405.0301

  8. [8]

    Recommendations for Best Practices for Data Preservation and Open Science in HEP

    Campana, Simone and others. Recommendations for Best Practices for Data Preservation and Open Science in HEP. 2025. arXiv:2508.18892

  9. [9]

    DPHEP-2025-01 , title = "

    Arbey, Alexandre and others. DPHEP-2025-01 , title = ". 2025. arXiv:2503.23619

  10. [10]

    2020 , publisher=

    Waleed Abdallah and others , journal=. 2020 , publisher=. doi:10.21468/SciPostPhys.9.2.022 , url=

  11. [11]

    A comprehensive guide to the physics and usage of PYTHIA 8.3

    Bierlich, Christian and others. A comprehensive guide to the physics and usage of PYTHIA 8.3. SciPost Phys. Codeb. 2022. doi:10.21468/SciPostPhysCodeb.8. arXiv:2203.11601

  12. [12]

    Jason Wei and others , title =. Trans. Mach. Learn. Res. , volume =. 2022 , eprint =

  13. [13]

    NeurIPS 2023

    Are Emergent Abilities of Large Language Models a Mirage? , author=. NeurIPS 2023. 2023 , eprint=

  14. [14]

    2021 , url =

    Mark Chen and others , title =. 2021 , url =. 2107.03374 , timestamp =

  15. [15]

    2023 , eprint=

    News Summarization and Evaluation in the Era of GPT-3 , author=. 2023 , eprint=

  16. [16]

    Frontiers Comput

    Wang, Lei and others , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1 , number=

  17. [17]

    2026 , month =

    Hightower, Kelsey , title =. 2026 , month =

  18. [18]

    Diefenbacher, Sascha and others , title = ". 2025. arXiv:2509.08535

  19. [19]

    and others

    Moreno, Eric A. and others. AI Agents Can Already Autonomously Perform Experimental High Energy Physics. 2026. arXiv:2603.20179

  20. [20]

    Agentic AI -- Physicist Collaboration in Experimental Particle Physics: A Proof-of-Concept Measurement with LEP Open Data

    Badea, Anthony and others. Agentic AI -- Physicist Collaboration in Experimental Particle Physics: A Proof-of-Concept Measurement with LEP Open Data. 2026. arXiv:2603.05735

  21. [21]

    Dr.Sai: An agentic AI for real-world physics analysis at BESIII

    He, Mingfeng and others. Dr.Sai: An agentic AI for real-world physics analysis at BESIII. 2026. arXiv:2604.22541

  22. [22]

    HEPTAPOD: Orchestrating High Energy Physics Workflows Towards Autonomous Agency

    Menzo, Tony and others. HEPTAPOD: Orchestrating High Energy Physics Workflows Towards Autonomous Agency. FERMILAB-PUB-25-0923-CSAID-ETD-T. 2025. arXiv:2512.15867

  23. [23]

    GRACE: an Agentic AI for Particle Physics Experiment Design and Simulation

    Hill, Justin and Ryoo, Hong Joo. GRACE: an Agentic AI for Particle Physics Experiment Design and Simulation. 2026. arXiv:2602.15039

  24. [24]

    Automating High Energy Physics Data Analysis with LLM-Powered Agents

    Gendreau-Distler, Eli and others. Automating High Energy Physics Data Analysis with LLM-Powered Agents. NeurIPS 2025. 2025. arXiv:2512.07785

  25. [25]

    MadAgents

    Plehn, Tilman and Schiller, Daniel and Schmal, Nikita. MadAgents. 2026. arXiv:2601.21015

  26. [26]

    Qwen2.5-Coder Technical Report

    Hui, Binyuan and others. Qwen2.5-Coder Technical Report. 2024. arXiv:2409.12186

  27. [27]

    Measurements of ZZ and ZZjj jj productions in pp collisions at s =13 TeV with the ATLAS detector

    ATLAS Collaboration. Measurements of ZZ and ZZjj jj productions in pp collisions at s =13 TeV with the ATLAS detector. 2025. arXiv:2511.15569

  28. [28]

    2017 , url =

    Ashish Vaswani and others , title =. 2017 , url =

  29. [29]

    arXiv:2504.00256

    2025. arXiv:2504.00256

  30. [30]

    doi:10.5281/zenodo.20646340 , url =

    Doglioni, Caterina and Gutschow, Christian and Jacques Costa, António and Pilkington, Andrew and Sinha, Sukanya , title = ". doi:10.5281/zenodo.20646340 , url =

  31. [31]

    and Palacios Schweitzer, Sofia and Pang, Ian and Mishra-Sharma, Siddharth and Shih, David

    Faroughy, Darius A. and Palacios Schweitzer, Sofia and Pang, Ian and Mishra-Sharma, Siddharth and Shih, David. Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction. 2026. arXiv:2605.13950