pith. machine review for the scientific record. sign in

arxiv: 2605.13950 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· hep-ex· hep-ph

Recognition: 2 theorem links

· Lean Theorem

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

Authors on Pith no claims yet

Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-exhep-ph
keywords Collider-BenchLLM agentsLHC analysis reproductionAI benchmarkingparticle physicstool-use evaluationscientific workflowreproducibility
0
0 comments X

The pith

No AI agent reliably beats a physicist when reproducing LHC analyses from public papers alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Collider-Bench to test whether language-model agents can turn published LHC analyses into working simulation and selection pipelines using only public papers and open software. Agents receive tasks that require filling gaps in implementation details through physical reasoning and trial-and-error, then submit predicted event yields that are scored with standard histogram metrics. The work also tracks compute cost and uses an LLM judge to flag fabrications or hallucinations in the generated code. Results across a ladder of general coding agents show that none consistently outperform a human physicist working with the same public resources. This benchmark highlights the gap between current agent capabilities and the demands of real experimental particle physics.

Core claim

Collider-Bench evaluates LLM agents on converting published LHC search papers into executable pipelines that produce predicted yields in specified signal regions, using only public information and open tools. The benchmark scores outputs with histogram fidelity metrics and an LLM-based judge for qualitative errors, while also logging compute cost. Across tested agents, average performance does not exceed that of a physicist-in-the-loop baseline that has access to the same public papers and software.

What carries the argument

Collider-Bench, a set of reproduction tasks that require agents to build simulation-and-selection pipelines from published LHC analyses and submit yield predictions scored by histogram metrics.

If this is right

  • Agents must bridge omitted implementation details using domain knowledge rather than direct code copying.
  • Performance is measured both quantitatively via yield histograms and qualitatively via LLM review for hallucinations.
  • A containerized sandbox with event simulation tools is provided for reproducible evaluation.
  • Tasks are drawn from real LHC searches to reflect actual experimental complexity.
  • Computational cost is reported per task to quantify resource demands of agent attempts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If agents improve on these tasks, they could accelerate validation of new analyses within experimental collaborations.
  • The benchmark could be extended to other data-intensive fields that rely on incomplete public documentation.
  • Success would require tighter integration of physics simulators and reasoning modules beyond general coding agents.
  • Persistent gaps may indicate the need for hybrid human-AI workflows rather than fully autonomous reproduction.

Load-bearing premise

Published papers and public software contain enough information for agents to complete faithful reproductions by using physical reasoning and trial-and-error.

What would settle it

An agent that produces yield predictions matching published results within the benchmark's histogram metrics on a majority of tasks while incurring lower or equal compute cost than the physicist baseline.

Figures

Figures reproduced from arXiv: 2605.13950 by Darius A. Faroughy, David Shih, Ian Pang, Siddharth Mishra-Sharma, Sofia Palacios Schweitzer.

Figure 1
Figure 1. Figure 1: Overview of the COLLIDER-BENCH workflow. The agent writes analysis code and gener￾ates events via the CLI tools, which interface with the public simulation stack. Yields are aggregated into a binned histogram and scored against the published reference. An LLM judge, outside the sandbox, audits the agent’s workspace. the published regions. COLLIDER-BENCH focuses on validating the recast pipeline against a s… view at source ↗
Figure 2
Figure 2. Figure 2: (a) The mean relative L 2 distance for each model and task (over 3 independent runs), with lower values indicating better agreement with the hidden reference yields. (b) The Pareto frontier of agent performance for Accτ versus inference cost for a fidelity threshold of τ = 0.33. 4.1 Experimental Setup We evaluate COLLIDER-BENCH using single-session autonomous coding agents. Each run starts from a fresh wor… view at source ↗
Figure 3
Figure 3. Figure 3: Representative simulation-task predictions, for [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Schematic of the event simulation pipeline. A Lagrangian [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The agent must convert event-selection criteria from the published analysis description [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: fraction pass/fail runs. Claude Code Claude Code Claude Code Codex Codex ForgeCode Collider-Bench Tasks Opus 4.7 Sonnet 4.6 Haiku 4.5 GPT-5.5 GPT-5.4-mini DeepSeek-V4 sus-16-034_sim-TChiWZ 0.19±0.12 0.30±0.14 0.893 0.29±0.08 0.65±0.35 0.80±0.29 sus-16-046_sim-T5Wg 0.51±0.38 0.61±0.46 18.0 0.13±0.08 0.50±0.45 0.99±0.01 sus-16-046_sim-TChiWg 0.49±0.39 0.93±0.05 0.92±0.10 0.28±0.06 0.84±0.23 0.84±0.27 sus-16-… view at source ↗
Figure 7
Figure 7. Figure 7: (a) The mean relative L 2 distance for each model and Shape task (over 3 independent runs), with lower values indicating better agreement with the hidden reference yields. (b) The Pareto frontier of agent performance for Accτ versus inference cost for a fidelity threshold of τ = 0.33. 100 200 300 400 E miss T [GeV] 10¡1 10 0 10 1 Yield sus-16-034_sim-TChiWZ 600 800 1000 1200 1400 1600 S γT [GeV] 10¡3 10¡2 … view at source ↗
Figure 8
Figure 8. Figure 8: Best-of-runs per agent against the published yields (grey) for every [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Best-of-runs per agent against the published shape (grey) for every [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces Collider-Bench, a benchmark for evaluating LLM agents on reproducing LHC particle physics analyses from public papers and open software. Agents must construct executable simulation-and-selection pipelines and submit predicted event yields in signal regions, scored via standard histogram metrics. The work evaluates a ladder of general-purpose coding agents, reports computational costs per task, uses an LLM judge to detect qualitative failures such as hallucinations, and releases an initial set of tasks together with a containerized sandbox. The central result is that on average no agent reliably outperforms the physicist-in-the-loop baseline.

Significance. If the results hold under fair conditions, Collider-Bench supplies a much-needed benchmark for long-horizon scientific tool use that requires physical reasoning to bridge gaps in public documentation. The release of concrete tasks, the sandbox, and evaluation tooling is a clear strength that supports reproducibility and follow-on work. The findings would usefully document current limitations of general agents relative to expert humans on realistic particle-physics reproduction tasks.

major comments (1)
  1. [Results and Evaluation] The headline claim that no agent reliably beats the physicist-in-the-loop solution is load-bearing for the paper's conclusions, yet the manuscript provides no explicit protocol for the baseline that matches the constraints stated for the agents (public papers only, open software, containerized sandbox, same time budget, no internal notes or non-public detector functions). Without this specification it is impossible to determine whether any performance gap reflects agent reasoning or unequal problem difficulty.
minor comments (1)
  1. [Abstract] The abstract refers to 'an initial set of tasks' without stating the number of analyses or the specific LHC searches involved; adding this information would help readers gauge the benchmark's current scope.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment concerns the specification of the physicist-in-the-loop baseline. We address it directly below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Results and Evaluation] The headline claim that no agent reliably beats the physicist-in-the-loop solution is load-bearing for the paper's conclusions, yet the manuscript provides no explicit protocol for the baseline that matches the constraints stated for the agents (public papers only, open software, containerized sandbox, same time budget, no internal notes or non-public detector functions). Without this specification it is impossible to determine whether any performance gap reflects agent reasoning or unequal problem difficulty.

    Authors: We agree that an explicit, matching protocol for the baseline is necessary to support the central claim. In the revised manuscript we will insert a new subsection (under Evaluation) that fully documents the baseline procedure. The physicist performed each task using only the public papers and open software inside the identical containerized sandbox, subject to the same per-task time budget, and without access to internal notes or non-public detector functions. This protocol will be stated in sufficient detail to allow direct comparison with the agent runs. We believe the addition will remove any ambiguity about whether the observed gap arises from unequal conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces Collider-Bench as an independent benchmark consisting of tasks drawn from published LHC analyses, with agents restricted to public papers and open software in a containerized sandbox. The central empirical claim (no agent reliably beats the physicist-in-the-loop baseline) is obtained by direct execution and scoring via histogram metrics plus LLM judging; these evaluation procedures are defined separately from the outcomes and do not reduce to any fitted parameter, self-citation, or definitional loop. No equations, ansatzes, or uniqueness theorems are invoked that would make the reported result equivalent to its inputs by construction. The benchmark definition and comparison protocol are self-contained against external tasks and standard metrics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The benchmark rests on standard assumptions about availability of public LHC papers and software; no free parameters, axioms beyond domain standards, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5561 in / 983 out tokens · 32594 ms · 2026-05-15T06:00:10.695853+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 14 internal anchors

  1. [1]

    MadAgents

    Plehn, Tilman and Schiller, Daniel and Schmal, Nikita. MadAgents. arXiv:2601.21015. 2026

  2. [2]

    An End-to-end Architecture for Collider Physics and Beyond

    Qiu, Shi and Cai, Zeyu and Wei, Jiashen and Li, Zeyu and Yin, Yixuan and Cao, Qing-Hong and Liu, Chang and Luo, Ming-xing and Yuan, Xing-Bo and Zhu, Hua Xing. An End-to-end Architecture for Collider Physics and Beyond. arXiv:2603.14553. 2026

  3. [3]

    The FERMIACC: Agents for Particle Theory

    Agrawal, Prateek and Craig, Nathaniel and Madden, Amalia and Lombera, I \ n igo Valenzuela. The FERMIACC: Agents for Particle Theory. arXiv:2603.22538. 2026

  4. [4]

    A comprehensive guide to the physics and usage of PYTHIA 8.3

    Bierlich, Christian and others. A comprehensive guide to the physics and usage of PYTHIA 8.3. SciPost Phys. Codeb. 2022. doi:10.21468/SciPostPhysCodeb.8. arXiv:2203.11601

  5. [5]

    DELPHES 3, A modular framework for fast simulation of a generic collider experiment

    de Favereau, J. and Delaere, C. and Demin, P. and Giammanco, A. and Lema \^ tre, V. and Mertens, A. and Selvaggi, M. DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP. 2014. doi:10.1007/JHEP02(2014)057. arXiv:1307.6346

  6. [6]

    Squark and Gluino Production at Hadron Colliders

    Beenakker, W. and Hopker, R. and Spira, M. and Zerwas, P. M. Squark and gluino production at hadron colliders. Nucl. Phys. B. 1997. doi:10.1016/S0550-3213(97)80027-2. arXiv:hep-ph/9610490

  7. [7]

    The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations

    Alwall, J. and Frederix, R. and Frixione, S. and Hirschi, V. and Maltoni, F. and Mattelaer, O. and Shao, H. -S. and Stelzer, T. and Torrielli, P. and Zaro, M. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP. 2014. doi:10.1007/JHEP07(2014)079. arXiv:1405.0301

  8. [8]

    Reinterpretation of LHC Results for New Physics: Status and Recommendations after Run 2

    Abdallah, Waleed and others. Reinterpretation of LHC Results for New Physics: Status and Recommendations after Run 2. SciPost Phys. 2020. doi:10.21468/SciPostPhys.9.2.022. arXiv:2003.07868

  9. [9]

    RECAST: Extending the Impact of Existing Analyses

    Cranmer, Kyle and Yavin, Itay. RECAST: Extending the Impact of Existing Analyses. JHEP. 2011. doi:10.1007/JHEP04(2011)038. arXiv:1010.2506

  10. [10]

    HEPData: a repository for high energy physics data

    Maguire, Eamonn and Heinrich, Lukas and Watt, Graeme. HEPData: a repository for high energy physics data. J. Phys. Conf. Ser. 2017. doi:10.1088/1742-6596/898/10/102006. arXiv:1704.05473

  11. [11]

    Clarification of the Use of Chi Square and Likelihood Functions in Fits to Histograms

    Baker, Steve and Cousins, Robert D. Clarification of the Use of Chi Square and Likelihood Functions in Fits to Histograms. Nucl. Instrum. Meth. 1984. doi:10.1016/0167-5087(84)90016-4

  12. [12]

    IEEE Trans

    Lin, Jianhua , title =. IEEE Trans. Inf. Theory , volume =

  13. [13]

    Nature Chemistry , volume =

    Mirza, Adrian and others , title =. Nature Chemistry , volume =. doi:10.1038/s41557-025-01815-x

  14. [14]

    Laurent and Joseph D

    Jon M. Laurent and Joseph D. Janizek and Michael Ruzo and Michaela M. Hinks and Michael J. Hammerling and Siddharth Narayanan and Manvitha Ponnapati and Andrew D. White and Samuel G. Rodriques , title =. arXiv:2407.10362 , year=

  15. [15]

    arXiv:2601.21165 , institution =

    Miles Wang and Robi Lin and Kat Hu and Joy Jiao and Neil Chowdhury and Ethan Chang and Tejal Patwardhan , title =. arXiv:2601.21165 , institution =

  16. [16]

    Chung, Daniel J. H. and Gao, Zhiqi and Kvasiuk, Yurii and Li, Tianyi and M. Theoretical physics benchmark (TPBench) a dataset and study of AI reasoning capabilities in theoretical physics. Mach. Learn. Sci. Tech. 2025. doi:10.1088/2632-2153/adfcb0. arXiv:2502.15815

  17. [17]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. arXiv:2310.06770 , year =

  18. [18]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A. Merrill and others , title =. arXiv:2601.11868 , year=

  19. [19]

    MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

    Chan, Jun Shern and others , title =. arXiv:2410.07095 , year=

  20. [20]

    arXiv:2310.03302 , year=

    Qian Huang and Jian Vora and Percy Liang and Jure Leskovec , title =. arXiv:2310.03302 , year=

  21. [21]

    arXiv:2411.15114 , year=

    Wijk, Hjalmar and others , title =. arXiv:2411.15114 , year=

  22. [22]

    ICLR , year =

    Chen, Ziru and others , title =. ICLR , year =

  23. [23]

    Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan , title =

    Zachary S. Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan , title =. arXiv:2409.11363 , year =

  24. [24]

    arXiv:2504.01848 , year =

    Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , title =. arXiv:2504.01848 , year =

  25. [25]

    arXiv:2503.00096 , year =

    Ludovico Mitchener and Jon M Laurent and Alex Andonian and Benjamin Tenmann and Siddharth Narayanan and Geemi P Wellawatte and Andrew White and Lorenzo Sani and Samuel G Rodriques , title =. arXiv:2503.00096 , year =

  26. [26]

    and Greenig, Matthew and Tenmann, Benjamin and Wang, Bo , title =

    Miller, Henry E. and Greenig, Matthew and Tenmann, Benjamin and Wang, Bo , title =. bioRxiv , year =. doi:10.1101/2025.09.01.673319 , publisher =

  27. [27]

    Nguyen and Baixuan Xu and Zhaowei Wang and Jiayang Cheng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Tianqing Fang and Yangqiu Song and Ginny Y

    Tianshi Zheng and Kelvin Kiu-Wai Tam and Newt Hue-Nam K. Nguyen and Baixuan Xu and Zhaowei Wang and Jiayang Cheng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Tianqing Fang and Yangqiu Song and Ginny Y. Wong and Simon See , title =. 2025 , journal =

  28. [28]

    2025 , journal =

    Nolan Koblischke and Hyunseok Jang and Kristen Menou and Mohamad Ali-Dib , title =. 2025 , journal =

  29. [29]

    Christine Ye and Sihan Yuan and Suchetha Cooray and Steven Dillmann and Ian L. V. Roque and Dalya Baron and Philipp Frank and Sergio Martin-Alvarez and Nolan Koblischke and Frank J Qu and Diyi Yang and Risa Wechsler and Ioana Ciuca , title =. arXiv:2510.24591 , year =

  30. [30]

    arXiv:2503.13517 , year =

    Cui, Hao and others , title =. arXiv:2503.13517 , year =

  31. [31]

    arXiv:2407.13168 , year=

    Tian, Minyang and others , title =. arXiv:2407.13168 , year=. 2407.13168 , archivePrefix=

  32. [32]

    arXiv:2512.07785 , year =

    Gendreau-Distler, Etienne and others , title =. arXiv:2512.07785 , year =

  33. [33]

    Menzo, Tony and Roman, Alexander and Gleyzer, Sergei and Matchev, Konstantin and Fleming, George T. and H. HEPTAPOD: Orchestrating High Energy Physics Workflows Towards Autonomous Agency , journal =

  34. [34]

    and Bright-Thonney, Samuel and Novak, Andrzej and Garcia, Dolores and Harris, Philip

    Moreno, Eric A. and Bright-Thonney, Samuel and Novak, Andrzej and Garcia, Dolores and Harris, Philip. AI Agents Can Already Autonomously Perform Experimental High Energy Physics , journal =

  35. [35]

    Advances in neural information processing systems , volume=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in neural information processing systems , volume=

  36. [36]

    arXiv:2507.02825 , year =

    Zhu, Yuxuan and others , title =. arXiv:2507.02825 , year =

  37. [37]

    , title =

    Tailcall, Inc. , title =. 2026 , howpublished =

  38. [38]

    Search for new phenomena in final states with two opposite-charge, same-flavor leptons, jets, and missing transverse momentum in pp collisions at $\sqrt{s} = $ 13 TeV

    Sirunyan, Albert M and others. Search for new phenomena in final states with two opposite-charge, same-flavor leptons, jets, and missing transverse momentum in pp collisions at s =13 TeV. JHEP. 2018. doi:10.1007/s13130-018-7845-2. arXiv:1709.08908

  39. [39]

    Search for gauge-mediated supersymmetry in events with at least one photon and missing transverse momentum in pp collisions at $\sqrt{s} = $ 13 TeV

    Sirunyan, Albert M and others. Search for gauge-mediated supersymmetry in events with at least one photon and missing transverse momentum in pp collisions at s = 13 TeV. Phys. Lett. B. 2018. doi:10.1016/j.physletb.2018.02.045. arXiv:1711.08008

  40. [40]

    Search for supersymmetry in events with at least one photon, missing transverse momentum, and large transverse event activity in proton-proton collisions at sqrt(s) = 13 TeV

    Sirunyan, Albert M and others. Search for supersymmetry in events with at least one photon, missing transverse momentum, and large transverse event activity in proton-proton collisions at s =13 TeV. JHEP. 2017. doi:10.1007/JHEP12(2017)142. arXiv:1707.06193

  41. [41]

    Search for top squark pair production in pp collisions at sqrt(s) = 13 TeV using single lepton events

    Sirunyan, Albert M and others. Search for top squark pair production in pp collisions at s =13 TeV using single lepton events. JHEP. 2017. doi:10.1007/JHEP10(2017)019. arXiv:1706.04402

  42. [42]

    2026 , howpublished =

    Anthropic , title =. 2026 , howpublished =

  43. [43]

    2026 , howpublished =

    OpenAI , title =. 2026 , howpublished =

  44. [44]

    2026 , howpublished =

    DeepSeek-AI , title =. 2026 , howpublished =