Recognition: 2 theorem links
· Lean TheoremCollider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Pith reviewed 2026-05-15 06:00 UTC · model grok-4.3
The pith
No AI agent reliably beats a physicist when reproducing LHC analyses from public papers alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Collider-Bench evaluates LLM agents on converting published LHC search papers into executable pipelines that produce predicted yields in specified signal regions, using only public information and open tools. The benchmark scores outputs with histogram fidelity metrics and an LLM-based judge for qualitative errors, while also logging compute cost. Across tested agents, average performance does not exceed that of a physicist-in-the-loop baseline that has access to the same public papers and software.
What carries the argument
Collider-Bench, a set of reproduction tasks that require agents to build simulation-and-selection pipelines from published LHC analyses and submit yield predictions scored by histogram metrics.
If this is right
- Agents must bridge omitted implementation details using domain knowledge rather than direct code copying.
- Performance is measured both quantitatively via yield histograms and qualitatively via LLM review for hallucinations.
- A containerized sandbox with event simulation tools is provided for reproducible evaluation.
- Tasks are drawn from real LHC searches to reflect actual experimental complexity.
- Computational cost is reported per task to quantify resource demands of agent attempts.
Where Pith is reading between the lines
- If agents improve on these tasks, they could accelerate validation of new analyses within experimental collaborations.
- The benchmark could be extended to other data-intensive fields that rely on incomplete public documentation.
- Success would require tighter integration of physics simulators and reasoning modules beyond general coding agents.
- Persistent gaps may indicate the need for hybrid human-AI workflows rather than fully autonomous reproduction.
Load-bearing premise
Published papers and public software contain enough information for agents to complete faithful reproductions by using physical reasoning and trial-and-error.
What would settle it
An agent that produces yield predictions matching published results within the benchmark's histogram metrics on a majority of tasks while incurring lower or equal compute cost than the physicist baseline.
Figures
read the original abstract
Autonomous language-model agents are increasingly evaluated on long-horizon tool-use tasks, but existing benchmarks rarely capture the complexity and nuance of real scientific work. To address this gap, we introduce Collider-Bench, a benchmark for evaluating whether LLM agents can reproduce experimental analyses from the Large Hadron Collider (LHC) using only public papers and open scientific software. Such analyses are often difficult to reproduce because the public toolchain only approximates the software used internally by the experimental collaborations, while the published papers inevitably omit implementation details needed for a faithful reconstruction. Agents must therefore rely on physical reasoning, domain knowledge, and trial-and-error to fill these gaps. Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics that provide continuous fidelity scores without a hand-written rubric. We also report the computational cost incurred by each agent per task. Finally, we evaluate the codebase and full session trace using an LLM judge to catch qualitative failure modes such as fabrications, hallucinations and duplications. We release an initial set of tasks drawn from LHC searches, together with a containerized sandbox and event simulation tools. We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Collider-Bench, a benchmark for evaluating LLM agents on reproducing LHC particle physics analyses from public papers and open software. Agents must construct executable simulation-and-selection pipelines and submit predicted event yields in signal regions, scored via standard histogram metrics. The work evaluates a ladder of general-purpose coding agents, reports computational costs per task, uses an LLM judge to detect qualitative failures such as hallucinations, and releases an initial set of tasks together with a containerized sandbox. The central result is that on average no agent reliably outperforms the physicist-in-the-loop baseline.
Significance. If the results hold under fair conditions, Collider-Bench supplies a much-needed benchmark for long-horizon scientific tool use that requires physical reasoning to bridge gaps in public documentation. The release of concrete tasks, the sandbox, and evaluation tooling is a clear strength that supports reproducibility and follow-on work. The findings would usefully document current limitations of general agents relative to expert humans on realistic particle-physics reproduction tasks.
major comments (1)
- [Results and Evaluation] The headline claim that no agent reliably beats the physicist-in-the-loop solution is load-bearing for the paper's conclusions, yet the manuscript provides no explicit protocol for the baseline that matches the constraints stated for the agents (public papers only, open software, containerized sandbox, same time budget, no internal notes or non-public detector functions). Without this specification it is impossible to determine whether any performance gap reflects agent reasoning or unequal problem difficulty.
minor comments (1)
- [Abstract] The abstract refers to 'an initial set of tasks' without stating the number of analyses or the specific LHC searches involved; adding this information would help readers gauge the benchmark's current scope.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The single major comment concerns the specification of the physicist-in-the-loop baseline. We address it directly below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Results and Evaluation] The headline claim that no agent reliably beats the physicist-in-the-loop solution is load-bearing for the paper's conclusions, yet the manuscript provides no explicit protocol for the baseline that matches the constraints stated for the agents (public papers only, open software, containerized sandbox, same time budget, no internal notes or non-public detector functions). Without this specification it is impossible to determine whether any performance gap reflects agent reasoning or unequal problem difficulty.
Authors: We agree that an explicit, matching protocol for the baseline is necessary to support the central claim. In the revised manuscript we will insert a new subsection (under Evaluation) that fully documents the baseline procedure. The physicist performed each task using only the public papers and open software inside the identical containerized sandbox, subject to the same per-task time budget, and without access to internal notes or non-public detector functions. This protocol will be stated in sufficient detail to allow direct comparison with the agent runs. We believe the addition will remove any ambiguity about whether the observed gap arises from unequal conditions. revision: yes
Circularity Check
No significant circularity in derivation or evaluation chain
full rationale
The paper introduces Collider-Bench as an independent benchmark consisting of tasks drawn from published LHC analyses, with agents restricted to public papers and open software in a containerized sandbox. The central empirical claim (no agent reliably beats the physicist-in-the-loop baseline) is obtained by direct execution and scoring via histogram metrics plus LLM judging; these evaluation procedures are defined separately from the outcomes and do not reduce to any fitted parameter, self-citation, or definitional loop. No equations, ansatzes, or uniqueness theorems are invoked that would make the reported result equivalent to its inputs by construction. The benchmark definition and comparison protocol are self-contained against external tasks and standard metrics.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Each task requires the agent to turn a published analysis into an executable simulation-and-selection pipeline and submit predicted collision event yields in specified signal regions. These predictions are evaluated with standard histogram metrics...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate across a capability ladder of general purpose coding agents. Our results show that on average no agent reliably beats the physicist-in-the-loop solution.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Plehn, Tilman and Schiller, Daniel and Schmal, Nikita. MadAgents. arXiv:2601.21015. 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
An End-to-end Architecture for Collider Physics and Beyond
Qiu, Shi and Cai, Zeyu and Wei, Jiashen and Li, Zeyu and Yin, Yixuan and Cao, Qing-Hong and Liu, Chang and Luo, Ming-xing and Yuan, Xing-Bo and Zhu, Hua Xing. An End-to-end Architecture for Collider Physics and Beyond. arXiv:2603.14553. 2026
-
[3]
The FERMIACC: Agents for Particle Theory
Agrawal, Prateek and Craig, Nathaniel and Madden, Amalia and Lombera, I \ n igo Valenzuela. The FERMIACC: Agents for Particle Theory. arXiv:2603.22538. 2026
-
[4]
A comprehensive guide to the physics and usage of PYTHIA 8.3
Bierlich, Christian and others. A comprehensive guide to the physics and usage of PYTHIA 8.3. SciPost Phys. Codeb. 2022. doi:10.21468/SciPostPhysCodeb.8. arXiv:2203.11601
work page internal anchor Pith review Pith/arXiv arXiv doi:10.21468/scipostphyscodeb.8 2022
-
[5]
DELPHES 3, A modular framework for fast simulation of a generic collider experiment
de Favereau, J. and Delaere, C. and Demin, P. and Giammanco, A. and Lema \^ tre, V. and Mertens, A. and Selvaggi, M. DELPHES 3, A modular framework for fast simulation of a generic collider experiment. JHEP. 2014. doi:10.1007/JHEP02(2014)057. arXiv:1307.6346
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep02(2014)057 2014
-
[6]
Squark and Gluino Production at Hadron Colliders
Beenakker, W. and Hopker, R. and Spira, M. and Zerwas, P. M. Squark and gluino production at hadron colliders. Nucl. Phys. B. 1997. doi:10.1016/S0550-3213(97)80027-2. arXiv:hep-ph/9610490
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/s0550-3213(97)80027-2 1997
-
[7]
Alwall, J. and Frederix, R. and Frixione, S. and Hirschi, V. and Maltoni, F. and Mattelaer, O. and Shao, H. -S. and Stelzer, T. and Torrielli, P. and Zaro, M. The automated computation of tree-level and next-to-leading order differential cross sections, and their matching to parton shower simulations. JHEP. 2014. doi:10.1007/JHEP07(2014)079. arXiv:1405.0301
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep07(2014)079 2014
-
[8]
Reinterpretation of LHC Results for New Physics: Status and Recommendations after Run 2
Abdallah, Waleed and others. Reinterpretation of LHC Results for New Physics: Status and Recommendations after Run 2. SciPost Phys. 2020. doi:10.21468/SciPostPhys.9.2.022. arXiv:2003.07868
-
[9]
RECAST: Extending the Impact of Existing Analyses
Cranmer, Kyle and Yavin, Itay. RECAST: Extending the Impact of Existing Analyses. JHEP. 2011. doi:10.1007/JHEP04(2011)038. arXiv:1010.2506
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep04(2011)038 2011
-
[10]
HEPData: a repository for high energy physics data
Maguire, Eamonn and Heinrich, Lukas and Watt, Graeme. HEPData: a repository for high energy physics data. J. Phys. Conf. Ser. 2017. doi:10.1088/1742-6596/898/10/102006. arXiv:1704.05473
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1088/1742-6596/898/10/102006 2017
-
[11]
Clarification of the Use of Chi Square and Likelihood Functions in Fits to Histograms
Baker, Steve and Cousins, Robert D. Clarification of the Use of Chi Square and Likelihood Functions in Fits to Histograms. Nucl. Instrum. Meth. 1984. doi:10.1016/0167-5087(84)90016-4
- [12]
-
[13]
Mirza, Adrian and others , title =. Nature Chemistry , volume =. doi:10.1038/s41557-025-01815-x
-
[14]
Jon M. Laurent and Joseph D. Janizek and Michael Ruzo and Michaela M. Hinks and Michael J. Hammerling and Siddharth Narayanan and Manvitha Ponnapati and Andrew D. White and Samuel G. Rodriques , title =. arXiv:2407.10362 , year=
-
[15]
arXiv:2601.21165 , institution =
Miles Wang and Robi Lin and Kat Hu and Joy Jiao and Neil Chowdhury and Ethan Chang and Tejal Patwardhan , title =. arXiv:2601.21165 , institution =
-
[16]
Chung, Daniel J. H. and Gao, Zhiqi and Kvasiuk, Yurii and Li, Tianyi and M. Theoretical physics benchmark (TPBench) a dataset and study of AI reasoning capabilities in theoretical physics. Mach. Learn. Sci. Tech. 2025. doi:10.1088/2632-2153/adfcb0. arXiv:2502.15815
-
[17]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik Narasimhan , title =. arXiv:2310.06770 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill and others , title =. arXiv:2601.11868 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Chan, Jun Shern and others , title =. arXiv:2410.07095 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Qian Huang and Jian Vora and Percy Liang and Jure Leskovec , title =. arXiv:2310.03302 , year=
-
[21]
Wijk, Hjalmar and others , title =. arXiv:2411.15114 , year=
- [22]
-
[23]
Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan , title =
Zachary S. Siegel and Sayash Kapoor and Nitya Nagdir and Benedikt Stroebl and Arvind Narayanan , title =. arXiv:2409.11363 , year =
-
[24]
Giulio Starace and Oliver Jaffe and Dane Sherburn and James Aung and Jun Shern Chan and Leon Maksin and Rachel Dias and Evan Mays and Benjamin Kinsella and Wyatt Thompson and Johannes Heidecke and Amelia Glaese and Tejal Patwardhan , title =. arXiv:2504.01848 , year =
-
[25]
Ludovico Mitchener and Jon M Laurent and Alex Andonian and Benjamin Tenmann and Siddharth Narayanan and Geemi P Wellawatte and Andrew White and Lorenzo Sani and Samuel G Rodriques , title =. arXiv:2503.00096 , year =
-
[26]
and Greenig, Matthew and Tenmann, Benjamin and Wang, Bo , title =
Miller, Henry E. and Greenig, Matthew and Tenmann, Benjamin and Wang, Bo , title =. bioRxiv , year =. doi:10.1101/2025.09.01.673319 , publisher =
-
[27]
Tianshi Zheng and Kelvin Kiu-Wai Tam and Newt Hue-Nam K. Nguyen and Baixuan Xu and Zhaowei Wang and Jiayang Cheng and Hong Ting Tsang and Weiqi Wang and Jiaxin Bai and Tianqing Fang and Yangqiu Song and Ginny Y. Wong and Simon See , title =. 2025 , journal =
work page 2025
-
[28]
Nolan Koblischke and Hyunseok Jang and Kristen Menou and Mohamad Ali-Dib , title =. 2025 , journal =
work page 2025
- [29]
- [30]
-
[31]
Tian, Minyang and others , title =. arXiv:2407.13168 , year=. 2407.13168 , archivePrefix=
-
[32]
Gendreau-Distler, Etienne and others , title =. arXiv:2512.07785 , year =
-
[33]
Menzo, Tony and Roman, Alexander and Gleyzer, Sergei and Matchev, Konstantin and Fleming, George T. and H. HEPTAPOD: Orchestrating High Energy Physics Workflows Towards Autonomous Agency , journal =
-
[34]
and Bright-Thonney, Samuel and Novak, Andrzej and Garcia, Dolores and Harris, Philip
Moreno, Eric A. and Bright-Thonney, Samuel and Novak, Andrzej and Garcia, Dolores and Harris, Philip. AI Agents Can Already Autonomously Perform Experimental High Energy Physics , journal =
-
[35]
Advances in neural information processing systems , volume=
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in neural information processing systems , volume=
-
[36]
Zhu, Yuxuan and others , title =. arXiv:2507.02825 , year =
- [37]
-
[38]
Sirunyan, Albert M and others. Search for new phenomena in final states with two opposite-charge, same-flavor leptons, jets, and missing transverse momentum in pp collisions at s =13 TeV. JHEP. 2018. doi:10.1007/s13130-018-7845-2. arXiv:1709.08908
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s13130-018-7845-2 2018
-
[39]
Sirunyan, Albert M and others. Search for gauge-mediated supersymmetry in events with at least one photon and missing transverse momentum in pp collisions at s = 13 TeV. Phys. Lett. B. 2018. doi:10.1016/j.physletb.2018.02.045. arXiv:1711.08008
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/j.physletb.2018.02.045 2018
-
[40]
Sirunyan, Albert M and others. Search for supersymmetry in events with at least one photon, missing transverse momentum, and large transverse event activity in proton-proton collisions at s =13 TeV. JHEP. 2017. doi:10.1007/JHEP12(2017)142. arXiv:1707.06193
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep12(2017)142 2017
-
[41]
Sirunyan, Albert M and others. Search for top squark pair production in pp collisions at s =13 TeV using single lepton events. JHEP. 2017. doi:10.1007/JHEP10(2017)019. arXiv:1706.04402
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/jhep10(2017)019 2017
- [42]
- [43]
- [44]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.