MadAgents

Daniel Schiller; Nikita Schmal; Tilman Plehn

arxiv: 2601.21015 · v3 · submitted 2026-01-28 · ✦ hep-ph

MadAgents

Tilman Plehn , Daniel Schiller , Nikita Schmal This is my paper

Pith reviewed 2026-05-16 10:16 UTC · model grok-4.3

classification ✦ hep-ph

keywords AI agentsMadGraphparticle physics simulationsLHCautonomous workflowsevent generation

0 comments

The pith

AI agents install, support and run full MadGraph simulation campaigns starting from a research paper PDF.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MadAgents as a set of AI agents that work directly with MadGraph to handle software installation, provide hands-on training, offer user support, and execute complete event-generation campaigns. The agents accept a PDF of a physics paper as input and autonomously set up and run the required simulations while analyzing results for both new and experienced users. A self-improvement loop in the updated implementation allows the agents to refine their performance over repeated tasks. If the agents perform reliably, the approach would reduce the technical overhead for LHC-related simulations and let researchers focus more on physics interpretation rather than setup details.

Core claim

MadAgents deliver agentic installation, learning-by-doing training, user support, and autonomous simulation campaigns for MadGraph, beginning directly from a PDF file of a paper, using an updated Claude Code implementation that includes a self-improvement loop.

What carries the argument

The set of communicative AI agents that interact with MadGraph's command-line interface to interpret papers, generate simulation setups, run event generation, and analyze outputs.

If this is right

Inexperienced users gain direct access to state-of-the-art MadGraph simulations through guided installation and training.
Simulation campaigns can be launched and completed autonomously once a paper PDF is provided.
Agents support a range of tasks including result analysis for both basic and advanced users.
The self-improvement loop enables the system to refine its performance on repeated simulation requests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent pattern could be tested on other command-line physics packages to check transferability.
Integration with version control or public repositories might allow agents to pull papers and reproduce results at scale.
Error rates in autonomous runs could be measured across a set of recent LHC papers to quantify reliability gains over time.

Load-bearing premise

The underlying AI model can reliably interpret physics papers and interact with MadGraph's command-line interface to produce correct simulation setups without frequent errors or human fixes.

What would settle it

A controlled test in which the agents receive a specific paper PDF, generate and execute the full simulation campaign end-to-end, and produce output files whose key distributions match the paper's published results without manual intervention.

read the original abstract

We uncover an effective and communicative set of agents working with MadGraph. Agentic installation, learning-by-doing training, and user support provide easy access to state-of-the-art simulations and accelerate LHC research. We show in detail how MadAgents interact with inexperienced and advanced users, support a range of simulation tasks, and analyze results. In a second step, we illustrate how MadAgents automatize event generation and run an autonomous simulation campaign, starting from a pdf file of a paper. The updated Claude Code implementation includes a self-improvement loop.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MadAgents shows AI agents automating MadGraph installation, support, and PDF-to-simulation workflows with a self-improvement loop, but the description stays qualitative with no error rates or benchmark comparisons.

read the letter

The core of this paper is a practical set of agents that install MadGraph, train themselves through tasks, answer user questions, and then run full event generation campaigns starting directly from a paper PDF. The self-improvement loop in the updated Claude Code version is the clearest addition over basic agent setups. They walk through concrete interactions with both new and experienced users and show the agents handling process definitions, parameter cards, and result analysis without constant human prompts. That end-to-end flow from PDF to output is the part that could actually change daily work for some LHC groups. The examples are clear enough to see how the agents communicate and correct course during a run. The paper stays focused on implementation details rather than claiming new physics. The main gap is the missing validation. There are no reported success rates across a set of papers, no tests on edge cases like ambiguous cuts or multi-process requests, and no side-by-side comparison with expert human setups. Without those numbers it is hard to know how often small interpretation errors would produce wrong cross sections or invalid cards. The claims rest on selected successful traces rather than systematic checks. This is useful for readers who already run MadGraph and want to reduce setup time or for groups exploring AI tooling in phenomenology. A referee could ask for quantitative benchmarks on a few dozen papers and failure-mode analysis, but the work is concrete enough to deserve that review rather than a desk rejection.

Referee Report

3 major / 2 minor

Summary. The paper presents MadAgents, a system of AI agents integrated with MadGraph that enable agentic installation, learning-by-doing training, user support for both novice and advanced users, and fully autonomous simulation campaigns that start from a PDF of a physics paper, with an added self-improvement loop in the updated Claude Code implementation.

Significance. If validated, the approach could meaningfully lower barriers to LHC phenomenology by automating routine MadGraph workflows and allowing non-experts to launch campaigns directly from the literature. The combination of interactive support and autonomous PDF-to-event-generation pipelines represents a novel application of agentic AI in high-energy physics tooling.

major comments (3)

[Autonomous simulation campaign and self-improvement loop sections] The central claim of reliable autonomous simulation campaigns (starting from an arbitrary PDF) is unsupported by any quantitative metrics: no success rates, error frequencies, failure-mode analysis, or side-by-side comparison against expert human setups are provided anywhere in the manuscript.
[Agentic installation, learning-by-doing training, and user support sections] The description of agent interactions with the MadGraph CLI (installation, process definition, parameter-card generation, and result analysis) remains purely qualitative; no test cases, edge-case handling, or validation against known MadGraph outputs are reported, leaving the reliability of the core workflow unverified.
[Updated Claude Code implementation and self-improvement loop] The self-improvement loop is asserted to enhance performance, yet the manuscript supplies neither concrete examples of corrections made by the loop nor any before/after performance indicators, making it impossible to assess whether the loop actually reduces errors.

minor comments (2)

Figure captions and workflow diagrams would benefit from explicit labels indicating which agent performs each step and which MadGraph commands are invoked.
The manuscript should include a short table summarizing the distinct agent roles and their primary responsibilities for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments on the MadAgents manuscript. We agree that quantitative validation is necessary to support the claims of reliability and will revise the manuscript to incorporate the requested metrics, test cases, and examples.

read point-by-point responses

Referee: [Autonomous simulation campaign and self-improvement loop sections] The central claim of reliable autonomous simulation campaigns (starting from an arbitrary PDF) is unsupported by any quantitative metrics: no success rates, error frequencies, failure-mode analysis, or side-by-side comparison against expert human setups are provided anywhere in the manuscript.

Authors: We acknowledge that the current manuscript presents the autonomous PDF-to-campaign workflow primarily through illustrative examples rather than statistical validation. In the revised version we will add a dedicated evaluation section reporting success rates across 20 sample PDFs drawn from recent LHC phenomenology papers, a breakdown of observed failure modes (PDF parsing ambiguities, command syntax errors, and parameter inconsistencies), and a side-by-side comparison of agent-generated setups versus expert manual configurations for a subset of processes, including wall-clock time and final cross-section agreement. revision: yes
Referee: [Agentic installation, learning-by-doing training, and user support sections] The description of agent interactions with the MadGraph CLI (installation, process definition, parameter-card generation, and result analysis) remains purely qualitative; no test cases, edge-case handling, or validation against known MadGraph outputs are reported, leaving the reliability of the core workflow unverified.

Authors: We agree that the core agent–MadGraph interactions require explicit validation. The revised manuscript will include a new appendix with concrete test cases: a complete installation trace, an example process-definition dialogue with the agent’s reasoning steps, generated parameter cards compared line-by-line to manually produced cards, and result-analysis outputs cross-checked against standard MadGraph reference outputs for benchmark processes such as pp → tt̄ and pp → WZ. revision: yes
Referee: [Updated Claude Code implementation and self-improvement loop] The self-improvement loop is asserted to enhance performance, yet the manuscript supplies neither concrete examples of corrections made by the loop nor any before/after performance indicators, making it impossible to assess whether the loop actually reduces errors.

Authors: The self-improvement loop is a recent addition to the Claude Code implementation. The revised text will supply two concrete examples of corrections performed by the loop (one involving recovery from an incorrect PDF-extracted parameter value and one fixing a mis-specified decay chain) together with before-and-after error counts and success-rate improvements measured over a fixed set of ten autonomous campaigns. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive software account with no derivations or fits

full rationale

The paper is a purely descriptive account of an AI agent system for MadGraph, covering installation, training, user support, autonomous campaigns from PDF inputs, and a self-improvement loop. It contains no mathematical derivations, equations, fitted parameters, predictions, or uniqueness theorems. All content consists of workflow descriptions and qualitative examples with no load-bearing steps that could reduce outputs to inputs by construction or via self-citation chains. The central claims rest on demonstrated functionality rather than any self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No theoretical content; the work is a software description with no free parameters, axioms, or invented physical entities.

pith-pipeline@v0.9.0 · 5369 in / 1085 out tokens · 22930 ms · 2026-05-16T10:16:29.473248+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
cs.LG 2026-05 unverdicted novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
RooAgent: An LLM Agent for Root-Based High Energy Physics Analysis
hep-ph 2026-05 unverdicted novelty 6.0

RooAgent provides an LLM agent interface that translates natural-language prompts into calls to PyROOT analysis functions for high energy physics tasks, with support for multiple AI backends and tested on ZH simulatio...