pith. sign in

arxiv: 2606.23993 · v3 · pith:S3ELCLAInew · submitted 2026-06-22 · 💻 cs.LG · cs.AI· hep-ex

Learning to Trigger: Reinforcement Learning at the Large Hadron Collider

Pith reviewed 2026-06-30 10:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AIhep-ex
keywords reinforcement learningtrigger systemsLarge Hadron Collideronline controlanomaly detectionthreshold tuningCMS experimentMonte Carlo simulation
0
0 comments X

The pith

A reinforcement learning agent adjusts LHC trigger thresholds in real time and transfers from simulation to real collision data without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames online tuning of trigger thresholds as a sequential decision problem in which an agent reads streaming summaries of background rates and signal features and adjusts thresholds to maximize signal efficiency while keeping the background rate inside a target tolerance band. It adapts Group-Filtered Policy Optimization with feasibility constraints and tests the resulting agents on two triggers: a total transverse energy trigger sensitive to pileup and an anomaly-detection trigger based on reconstruction loss. On Monte Carlo streams the agent raises the fraction of in-tolerance intervals by 48 percent for the energy trigger and 28 percent for the anomaly trigger, with up to 2 percent additional signal efficiency on those intervals. The identical agent, applied unchanged to real CMS Run 283408 data, produces 56 percent and 28 percent gains in in-tolerance time together with further efficiency improvements. A reader would care because current triggers are static and hand-tuned, so any automatic method that maintains bandwidth limits while capturing more signal events directly increases the physics output of the collider.

Core claim

By casting threshold tuning as a streaming control task and adapting Group-Filtered Policy Optimization with two feasibility-enforcing variants, the authors show that the learned agent increases the fraction of time background rates remain inside tolerance by 48 percent (HT) and 28 percent (AD) on Monte Carlo streams and by 56 percent (HT) and 28 percent (AD) on real CMS Run 283408 data, while also raising signal efficiency on the in-tolerance intervals, constituting the first reported demonstration of reinforcement-learning trigger control on actual Large Hadron Collider collision data.

What carries the argument

Group-Filtered Policy Optimization (GFPO) agent adapted for streaming control that ingests recent rate and feature summaries to update trigger thresholds while enforcing background-rate feasibility.

If this is right

  • Signal efficiency rises by up to 2 percent on the intervals where background rate stays inside tolerance.
  • The same policy transfers directly to real data without any fine-tuning step.
  • The approach works for both a pileup-sensitive total-energy trigger and an anomaly-detection trigger.
  • Feasibility constraints introduced in the GFPO-F and GFPO-FR variants keep the background rate inside the allowed band during training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Continuous online adaptation could reduce the frequency of manual retuning campaigns during long LHC running periods.
  • Extending the same streaming formulation to additional trigger types could produce coordinated optimization across an entire trigger menu.
  • If the observed sim-to-real transfer holds across multiple runs and years, the method may tolerate the gradual changes in detector response and beam conditions that occur in practice.
  • Analogous streaming reinforcement-learning controllers could be examined in other high-rate scientific instruments that must filter data under strict bandwidth and latency limits.

Load-bearing premise

Monte Carlo streams used for training accurately capture the statistical properties and drift patterns of the real collision data the agent will meet at runtime.

What would settle it

Applying the trained agent to a new real collision run and measuring that the in-tolerance time fraction falls to or below the level achieved by the static baselines.

Figures

Figures reproduced from arXiv: 2606.23993 by Abhijith Gandrakota, Cecilia Tosciri, Christian Herwig, David W. Miller, Giovanna Salvi, Jennifer Ngadiuba, Nhan Tran, Shaghayegh Emami, Yuxin Chen, Zixin Ding.

Figure 1
Figure 1. Figure 1: Sensitivity Analysis of Reward Components for HT trigger (20% MC). Each point represents a (λ1, λ2) configuration from Equation 2, with concave hulls connecting the upper envelope per method. The x-axis measures the fraction of chunks whose background rate falls within the tolerance band, and the y-axis measures overall signal efficiency. Our methods (GFPO-F and GFPO￾FR) collapse to a tighter cluster in th… view at source ↗
Figure 2
Figure 2. Figure 2: GRPO’s group-feasibility failure on HT (latter 20% of MC, G = 16) averaged across 3 seeds. (a) Candidate background rates; green/red denote feasible/infeasible w.r.t. the tolerance band. (b) Per-step feasible fraction ft = nfeas/G (run mean ⟨ft⟩ = 0.58). ft=0 on 30.8% of steps, so the group has no in-band sample and GRPO reinforces the least-infeasible (still out-of-band) action. AD trigger is in [PITH_FU… view at source ↗
Figure 3
Figure 3. Figure 3: Background rates for HT (evaluated on 20% MC). Our methods (GFPO-F and GFPO￾FR) concentrate the background rate inside the [90, 110] kHz tolerance band around the 100 kHz target, while the baselines spread well outside it. GFPO-F sits on target (µ = 100.0 kHz) with the tighter distribution (σ = 2.0 kHz). GFPO-FR runs at a slightly higher mean (µ = 106.1 kHz) with a marginally wider spread (σ = 3.4 kHz). Di… view at source ↗
Figure 4
Figure 4. Figure 4: Average step composition for HT (K=16, G=64 for GFPO, G=K=16 for GRPO). Each step is classified by |Ft| relative to K: Pure feasible (|Ft| ≥ K, kept set Kt ⊆ Ft), Padded (0 < |Ft| < K, kept set mixes feasi￾ble and out-of-band candidates), Zero feasibility (Ft = ∅). AD trigger is in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Anomaly detection results on UNSW-NB15 (left) and NAB (right). [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Background scores drift over time for both triggers for MC. Running mean (solid), median [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Background scores drift over time for both triggers for CMS Run 283408. Running mean [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Signal-background discrimination for the [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Signal-background discrimination for the [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (a) GFPO-F for AD trigger (CMS Run 283408) (b) GFPO-FR for AD trigger (CMS Run [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity Analysis of Reward Components (MC). Each point represents a (λ1, λ2) configuration from Equation 2, with concave hulls connecting the upper envelope per method. The x-axis measures the fraction of chunks whose background rate falls within the tolerance band, and the y-axis measures overall signal efficiency. Across all four trigger, signal combinations (HT /AD × tt¯/h → 4b), baseline methods (… view at source ↗
Figure 12
Figure 12. Figure 12: DSPOT calibration-deployment mismatch: the GPD is fitted on 50 calibration chunks [PITH_FULL_IMAGE:figures/full_fig_p039_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Zero-feasible step fraction for GRPO and L-GRPO (AD trigger). (a) Mean zero￾feasible fraction across all chunks. On MC data, approximately 20% of micro-steps yield no rate-feasible candidate from the sampled group of G actions for both GRPO and L-GRPO. On CMS real data this rises to ∼61% for both methods. (b) Per-chunk trajectory over time. The two curves are nearly indistinguishable in both datasets, con… view at source ↗
Figure 14
Figure 14. Figure 14: Dual variable λt and in-band rate over chunks for L-GRPO (AD trigger). (a) On MC data, λt decreases monotonically from its initial value of 0.25 as the policy stays in-band (∼96% of chunks). (b) On CMS real collision data, λt fluctuates but remains roughly constant (≈ 0.25) while the in-band rate oscillates around 54%. Shaded regions mark chunks where the in-band rate is below 50%. Despite λt responding t… view at source ↗
Figure 15
Figure 15. Figure 15: Background rate trajectory on CMS real data (AD trigger, MC-trained frozen policy). Grey band: ±τ tolerance around the target r ∗ B. GFPO-F and GFPO-FR track within the band for most of the 73-chunk deployment window. GRPO and L-GRPO sit near or below the lower band edge throughout, despite L-GRPO’s dual variable adapting concurrently ( [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Zero-feasibility fraction of GRPO vs. group size G. Fraction of training steps on which every rollout violates the rate budget, for AD and HT triggers. Even at G = 256, 24–34% of steps yield no feasible sample, so the group-relative baseline collapses to an uninformative signal on a constant fraction of updates. Scaling G alone cannot recover constraint satisfaction, motivating the constraint-aware varian… view at source ↗
Figure 17
Figure 17. Figure 17: GRPO group-feasibility failure (AD trigger, MC, G=16); appendix complement to [PITH_FULL_IMAGE:figures/full_fig_p043_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Step composition for the AD trigger; Complement to [PITH_FULL_IMAGE:figures/full_fig_p044_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Look-ahead rates (p1, p2) and survival ratios (r1, r2) over time, evaluated at each window’s natural operating point θ = P99.75 so that p0 ≈ r ∗ B by construction. The four scalars give the agent a discrete forward-rate map of the tail at θ + ∆ and θ + 2∆ (with ∆ = 1 GeV on HT and ∆ = 0.5 on AD), complementing the local sensitivity probe ∂r/∂θ. Their drift over the run motivates including (p1, p2, r1, r2)… view at source ↗
Figure 20
Figure 20. Figure 20: Non-stationarity of trigger thresholds under pileup drift (MC). [PITH_FULL_IMAGE:figures/full_fig_p047_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Npv (Number of Primary Vertices) distribution over time (MC). 47 [PITH_FULL_IMAGE:figures/full_fig_p047_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Exponential Moving Average (EMA) of the fractional rate error at different smoothing [PITH_FULL_IMAGE:figures/full_fig_p048_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Sensitivity probe over time (MC) [PITH_FULL_IMAGE:figures/full_fig_p048_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Near-threshold occupancy over time (MC). [PITH_FULL_IMAGE:figures/full_fig_p048_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Background trigger rates under DQN for (a) [PITH_FULL_IMAGE:figures/full_fig_p049_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Background rates distribution for (a) HT trigger, (b) AD trigger. Our methods (GFPO-F and GFPO-FR) have the highest in-band rate of all established baselines. GFPO-FR (106.1 kHz) operates nearer the band edge for signal compared to GFPO (100.0 kHz). Signal efficiency over time on MC. Emami et al. [5] showcases that adaptive thresholding recovers signal efficiency as a run progresses, while a static menu s… view at source ↗
Figure 27
Figure 27. Figure 27: Background rates under Constant Menu, PID loop, GRPO, GFPO-F and GFPO-FR (MC) [PITH_FULL_IMAGE:figures/full_fig_p050_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Signal efficiency over time under Constant Menu, PID loop, GRPO, GFPO-F and GFPO [PITH_FULL_IMAGE:figures/full_fig_p051_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Background trigger rates under MC→CMS transfer ( [PITH_FULL_IMAGE:figures/full_fig_p052_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Background trigger rates under MC→CMS transfer with test-time training (Ta￾ble 12). Per-step rates on CMS Run 283408 with policies trained on MC and updated online during the run on real collision data (one gradient step per chunk), for the (a) HT trigger and (b) AD trigger. G Ablation study on noisy anomaly scores In HEP anomaly detection, a long line of work has explicitly highlighted the sensitivity of… view at source ↗
Figure 31
Figure 31. Figure 31: UNSW-NB15 TPR by attack difficulty and adaptation regime. For each controller we report TPR on two attack classes: Exploits (frequent, high-signal; easy) and Backdoors (rare, low￾signal; hard). Bars span {Deployment, Test-time training} × {Exploits, Backdoors}: solid = frozen policy, hatched = online adaptation; high opacity = Exploits, low opacity = Backdoors. GFPO-F and GFPO-FR (ours) maintain the highe… view at source ↗
Figure 32
Figure 32. Figure 32: NAB precision and recall per method (mean over 24 test files). Solid bars: precision; hatched bars: recall. Methods sorted left-to-right by ascending F1. All controllers operate on the same per-chunk anomaly scores under the same FAR budget. GFPO-F and GFPO-FR (ours) dominate the Pareto frontier: precision ≈ 28% and recall ≈ 42%. Classical controllers (Constant, PID, DSPOT) reach 11–20% precision but reca… view at source ↗
read the original abstract

High-throughput scientific facilities such as the Large Hadron Collider depend on real-time event filtering (\textit{triggering}) under tight constraints on bandwidth, latency, and storage. In practice, trigger menus are largely static and hand-tuned and can become suboptimal as detector conditions, pileup, and background composition drift over time. We cast online threshold tuning as a sequential decision-making problem: a reinforcement learning agent ingests streaming summaries of recent rates and signal-sensitive features and updates trigger thresholds to maximize signal efficiency while tracking a target background rate within a tolerance band. We adapt Group-Filtered Policy Optimization (GFPO) to streaming control and introduce two variants (GFPO-F, GFPO-FR) that enforce background rate feasibility during training. On a benchmark that emulates realistic collider operation, we study two representative triggers: a total transverse energy ($H_{T}$) trigger sensitive to pileup variation, and an anomaly-detection (AD) trigger based on reconstruction loss for rare or non-standard signatures. On Monte Carlo streams, our agent increases the fraction of in-tolerance time intervals by 48\% ($H_T$) and 28\% (AD), with a cumulative gain of up to 2\% in signal efficiency on those in-tolerance intervals. Transferring from simulation to \emph{real} collision data (CMS Run 283408), the same agent, without fine-tuning, achieves a 56\% ($H_T$) and 28\% (AD) in-tolerance improvement over baselines, with further signal-efficiency gain on both triggers. To our knowledge, this is the \emph{first} demonstration of RL-based trigger control on real Large Hadron Collider collision data. Code is available at https://github.com/Zixind/GFPO_LHC (see repo for details).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper casts LHC trigger threshold tuning as a streaming RL control problem and adapts Group-Filtered Policy Optimization (GFPO) with feasibility constraints (GFPO-F, GFPO-FR). On Monte Carlo streams the agent raises the fraction of in-tolerance intervals by 48% (HT) and 28% (AD) while adding up to 2% signal efficiency. The same policy, without fine-tuning, is reported to deliver 56% (HT) and 28% (AD) in-tolerance gains on real CMS Run 283408 collision data, presented as the first RL-based trigger demonstration on actual LHC data. Code is released.

Significance. If the sim-to-real transfer result is robust, the work would show that RL can maintain trigger performance under realistic drift without manual retuning, a practical advance for high-throughput scientific facilities. The open-source release and the explicit zero-shot transfer experiment are concrete strengths that aid reproducibility and allow independent verification.

major comments (2)
  1. [Abstract] Abstract (transfer paragraph): the central claim of 56% (HT) and 28% (AD) in-tolerance improvement on CMS Run 283408 without fine-tuning rests on the unverified assumption that Monte Carlo training streams match the statistical properties of the real data. No distributional alignment metrics (rate histograms, pileup spectra, or feature-space distances) are supplied to support this match.
  2. [Abstract] Abstract: the reported performance numbers are given without error bars, confidence intervals, or any statistical test against the baselines. This omission directly affects the ability to judge whether the claimed gains are distinguishable from run-to-run variability.
minor comments (1)
  1. [Abstract] The abstract introduces GFPO-F and GFPO-FR but does not indicate where in the manuscript the precise feasibility constraints or the group-filtering mechanism are defined.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive comments. We address each major point below and commit to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract (transfer paragraph): the central claim of 56% (HT) and 28% (AD) in-tolerance improvement on CMS Run 283408 without fine-tuning rests on the unverified assumption that Monte Carlo training streams match the statistical properties of the real data. No distributional alignment metrics (rate histograms, pileup spectra, or feature-space distances) are supplied to support this match.

    Authors: We agree that explicit distributional alignment metrics would strengthen the sim-to-real transfer claim. Although the Monte Carlo samples were generated with standard CMS tuning procedures to reproduce observed pileup and rate distributions, the current manuscript does not include side-by-side histograms or distance measures. In the revision we will add rate histograms, pileup spectra, and feature-space distances (e.g., Wasserstein or MMD) between the training streams and Run 283408, together with a short discussion of residual mismatches and their expected impact on policy transfer. revision: yes

  2. Referee: [Abstract] Abstract: the reported performance numbers are given without error bars, confidence intervals, or any statistical test against the baselines. This omission directly affects the ability to judge whether the claimed gains are distinguishable from run-to-run variability.

    Authors: The referee correctly notes the absence of uncertainty quantification. The reported percentages are point estimates obtained from single long streams. In the revised manuscript we will recompute all in-tolerance and efficiency figures with bootstrap confidence intervals derived from multiple independent rollouts (both in simulation and on the real run) and will include paired statistical tests against the baseline controllers to assess whether the observed gains exceed run-to-run variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL results on held-out MC and real data streams

full rationale

The paper reports an RL agent (GFPO variants) trained on Monte Carlo streams and evaluated zero-shot on held-out MC and real CMS Run 283408 data. Performance metrics (in-tolerance fraction, signal efficiency) are direct empirical comparisons against baselines. No equations, fitted parameters, or self-citation chains are shown that reduce the reported gains to inputs defined inside the paper. The derivation chain consists of standard RL adaptation plus experimental benchmarks; results remain falsifiable against external data streams.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that streaming rate summaries are informative enough for policy learning and that the GFPO variants enforce feasibility without introducing new fitted constants beyond standard RL hyperparameters.

free parameters (1)
  • GFPO learning-rate and feasibility parameters
    Standard RL hyperparameters and the two feasibility variants (GFPO-F, GFPO-FR) are introduced but their specific values are not enumerated in the abstract.
axioms (1)
  • domain assumption Streaming summaries of recent rates and signal-sensitive features contain sufficient information to decide threshold updates that track a target background rate.
    The agent architecture is defined to ingest exactly these summaries.

pith-pipeline@v0.9.1-grok · 5901 in / 1212 out tokens · 32607 ms · 2026-06-30T10:18:13.039251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Summary of the trigger systems of the large hadron collider experiments alice, atlas, cms and lhcb.Journal of Physics G: Nuclear and Particle Physics, 52(3):030501, 2025

    Johannes Albrecht, Leon Bozianu, Lukas Calefice, Sofia Cella, CE Cocha Toapaxi, Caterina Doglioni, VV Gligorov, James Andrew Gooding, Kaare Endrup Iversen, Patin Inkaew, et al. Summary of the trigger systems of the large hadron collider experiments alice, atlas, cms and lhcb.Journal of Physics G: Nuclear and Particle Physics, 52(3):030501, 2025

  2. [2]

    Towards an inter- pretable data-driven trigger system for high-throughput physics facilities.arXiv preprint arXiv:2104.06622, 2021

    Chinmaya Mahesh, Kristin Dona, David W Miller, and Yuxin Chen. Towards an inter- pretable data-driven trigger system for high-throughput physics facilities.arXiv preprint arXiv:2104.06622, 2021

  3. [3]

    The cms high level trigger.The European Physical Journal C-Particles and Fields, 46(3):605–667, 2006

    CMS collaboration. The cms high level trigger.The European Physical Journal C-Particles and Fields, 46(3):605–667, 2006

  4. [4]

    Performance of the atlas level-1 topological trigger in run 2.The European Physical Journal C, 82(1):7, 2022

    Georges Aad, Brad Abbott, Dale Charles Abbott, A Abed Abud, Kira Abeling, Deshan Kavishka Abhayasinghe, Syed Haider Abidi, OS AbouZeid, NL Abraham, Halina Abramowicz, et al. Performance of the atlas level-1 topological trigger in run 2.The European Physical Journal C, 82(1):7, 2022

  5. [5]

    Miller, Jennifer Ngadiuba, and Nhan Tran

    Shaghayegh Emami, Cecilia Tosciri, Giovanna Salvi, Zixin Ding, Yuxin Chen, Abhijith Gan- drakota, Christian Herwig, David W. Miller, Jennifer Ngadiuba, and Nhan Tran. Towards a self-driving trigger at the LHC: Adaptive response in real time.Machine Learning: Science and Technology, 2026. doi: 10.1088/2632-2153/ae631f. URL https://iopscience.iop.org/ artic...

  6. [6]

    An automated bandwidth division for the lhcb upgrade trigger.Computing and Software for Big Science, 9(1):7, 2025

    Timothy Evans, Conor Fitzpatrick, and Joshua Horswill. An automated bandwidth division for the lhcb upgrade trigger.Computing and Software for Big Science, 9(1):7, 2025

  7. [7]

    Description and performance of track and primary-vertex reconstruc- tion with the cms tracker.Journal of Instrumentation, 9(10):P10009–P10009, 2014

    CMS collaboration et al. Description and performance of track and primary-vertex reconstruc- tion with the cms tracker.Journal of Instrumentation, 9(10):P10009–P10009, 2014

  8. [8]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998

  9. [9]

    Deep recurrent q-learning for partially observable mdps

    Matthew J Hausknecht and Peter Stone. Deep recurrent q-learning for partially observable mdps. InAAAI fall symposia, volume 45, page 141, 2015

  10. [10]

    Dream to control: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. InInternational Conference on Learning Representa- tions, 2020. URLhttps://openreview.net/forum?id=S1lOTC4tDS

  11. [11]

    Thickbrick: optimal event selection and categorization in high energy physics

    Konstantin T Matchev and Prasanth Shyamsundar. Thickbrick: optimal event selection and categorization in high energy physics. part i. signal discovery.Journal of High Energy Physics, 2021(3):291, 2021

  12. [12]

    Sequential anomaly detection using inverse reinforcement learning

    Min-hwan Oh and Garud Iyengar. Sequential anomaly detection using inverse reinforcement learning. InProceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & data mining, pages 1480–1490, 2019

  13. [13]

    Adt: Time series anomaly detection for cyber-physical systems via deep reinforcement learning.Computers & Security, 141:103825, 2024

    Xue Yang, Enda Howley, and Michael Schukat. Adt: Time series anomaly detection for cyber-physical systems via deep reinforcement learning.Computers & Security, 141:103825, 2024

  14. [14]

    Variational autoencoders for new physics mining at the large hadron collider.Journal of High Energy Physics, 2019(5):1–29, 2019

    Olmo Cerri, Thong Q Nguyen, Maurizio Pierini, Maria Spiropulu, and Jean-Roch Vlimant. Variational autoencoders for new physics mining at the large hadron collider.Journal of High Energy Physics, 2019(5):1–29, 2019

  15. [15]

    The atlas run-3 trigger menu

    Sofia Cella. The atlas run-3 trigger menu. Technical report, ATL-COM-DAQ-2024-077, 2024

  16. [16]

    About CMS Open Data, 2024

    CMS Collaboration. About CMS Open Data, 2024. URL https://opendata.cern.ch/ docs/about-cms. Accessed: March 9, 2025. 13

  17. [17]

    Sample more to think less: Group filtered policy optimization for concise reasoning

    Vaishnavi Shrivastava, Ahmed Hassan Awadallah, Vidhisha Balachandran, Shivam Garg, Harki- rat Behl, and Dimitris Papailiopoulos. Sample more to think less: Group filtered policy optimization for concise reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=UKOqoULbZS

  18. [18]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  19. [19]

    Alphaflow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning.Nature Communications, 14(1):1403, 2023

    Amanda A V olk, Robert W Epps, Daniel T Yonemoto, Benjamin S Masters, Felix N Castellano, Kristofer G Reyes, and Milad Abolhasani. Alphaflow: autonomous discovery and optimization of multi-step chemistry using a self-driven fluidic lab guided by reinforcement learning.Nature Communications, 14(1):1403, 2023

  20. [20]

    Reinforcement learning-trained optimisers and bayesian optimisation for online particle accelerator tuning

    Jan Kaiser, Chenran Xu, Annika Eichler, Andrea Santamaria Garcia, Oliver Stein, Erik Brün- dermann, Willi Kuropka, Hannes Dinter, Frank Mayet, Thomas Vinatier, et al. Reinforcement learning-trained optimisers and bayesian optimisation for online particle accelerator tuning. Scientific reports, 14(1):15733, 2024

  21. [21]

    Magnetic control of tokamak plasmas through deep reinforcement learning.Nature, 602(7897): 414–419, 2022

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning.Nature, 602(7897): 414–419, 2022

  22. [22]

    Meta-aad: Active anomaly detection with deep reinforcement learning

    Daochen Zha, Kwei-Herng Lai, Mingyang Wan, and Xia Hu. Meta-aad: Active anomaly detection with deep reinforcement learning. In2020 IEEE International Conference on Data Mining (ICDM), pages 771–780. IEEE, 2020

  23. [23]

    Anomaly detection in streams with extreme value theory

    Alban Siffer, Pierre-Alain Fouque, Alexandre Termier, and Christine Largouet. Anomaly detection in streams with extreme value theory. InProceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1067–1075, 2017

  24. [24]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  25. [25]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5

  26. [26]

    Ekaterina Govorkova, Ema Puljak, Thea Aarrestad, Thomas James, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, Nicolo Ghielmetti, Maksymilian Graczyk, Sioni Summers, et al. Autoencoders on field-programmable gate arrays for real-time, unsupervised new physics detection at 40 mhz at the large hadron collider.Nature Machine Intelligence, 4(2):154–161, 2022

  27. [27]

    A primer on reinforcement learning in medicine for clinicians.NPJ digital medicine, 7(1):337, 2024

    Pushkala Jayaraman, Jacob Desman, Moein Sabounchi, Girish N Nadkarni, and Ankit Sakhuja. A primer on reinforcement learning in medicine for clinicians.NPJ digital medicine, 7(1):337, 2024

  28. [28]

    Smooth imitation learning for online sequence prediction

    Hoang Le, Andrew Kang, Yisong Yue, and Peter Carr. Smooth imitation learning for online sequence prediction. InInternational Conference on Machine Learning, pages 680–688. PMLR, 2016

  29. [29]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    V olodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

  30. [30]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  31. [31]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InInternational conference on machine learning, pages 22–31. Pmlr, 2017. 14

  32. [32]

    Performance of the atlas trigger system in 2015.The European Physical Journal C, 77(5):1–53, 2017

    Morad Aaboud, Georges Aad, Brad Abbott, Jalal Abdallah, Baptiste Abeloos, Rosemarie Aben, OS AbouZeid, NL Abraham, Halina Abramowicz, Henso Abreu, et al. Performance of the atlas trigger system in 2015.The European Physical Journal C, 77(5):1–53, 2017

  33. [33]

    Operation of the atlas trigger system in run 2.Journal of Instrumenta- tion, 15(10):P10004–P10004, 2020

    Atlas Collaboration et al. Operation of the atlas trigger system in run 2.Journal of Instrumenta- tion, 15(10):P10004–P10004, 2020

  34. [34]

    Technical Design Report for the Phase-II Upgrade of the ATLAS Tile Calorimeter

    ATLAS Collaboration. Technical Design Report for the Phase-II Upgrade of the ATLAS Tile Calorimeter. Technical report, CERN, Geneva, 2017. URL https://cds.cern.ch/record/ 2285583

  35. [35]

    Safe reinforcement learning via shielding

    Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. InProceedings of the AAAI conference on artificial intelligence, 2018

  36. [36]

    A dynamic safety shield for safe and efficient reinforcement learning of navigation tasks

    Murad Dawood, Ahmed Shokry, and Maren Bennewitz. A dynamic safety shield for safe and efficient reinforcement learning of navigation tasks. In7th Annual Learning for Dynamics & Control Conference, pages 686–697. PMLR, 2025

  37. [37]

    What matters in on-policy reinforcement learning? a large-scale empirical study.arXiv preprint arXiv:2006.05990, 2020

    Marcin Andrychowicz, Anton Raichuk, Piotr Sta´nczyk, Manu Orsini, Sertan Girgin, Raphael Marinier, Léonard Hussenot, Matthieu Geist, Olivier Pietquin, Marcin Michalski, et al. What matters in on-policy reinforcement learning? a large-scale empirical study.arXiv preprint arXiv:2006.05990, 2020

  38. [38]

    Turn-ppo: Turn-level advantage estimation with ppo for improved multi-turn rl in agentic llms

    Junbo Li, Peng Zhou, Rui Meng, Meet P Vadera, Lihong Li, and Yang Li. Turn-ppo: Turn-level advantage estimation with ppo for improved multi-turn rl in agentic llms. InFindings of the Association for Computational Linguistics: EACL 2026, pages 6227–6243, 2026

  39. [39]

    Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

    Vijay Konda and John Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

  40. [40]

    High- dimensional continuous control using generalized advantage estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High- dimensional continuous control using generalized advantage estimation. InProceedings of the International Conference on Learning Representations (ICLR), 2016

  41. [41]

    Time limits in reinforcement learning

    Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. InInternational Conference on Machine Learning, pages 4045–4054. PMLR, 2018

  42. [42]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  43. [43]

    The dependence of effective planning horizon on model accuracy

    Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. InProceedings of the 2015 international conference on autonomous agents and multiagent systems, pages 1181–1189, 2015

  44. [44]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  45. [45]

    Mankowitz, and Shie Mannor

    Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. InInternational Conference on Learning Representations, 2019. URL https://openreview. net/forum?id=SkfrvsA9FX

  46. [46]

    Responsive safety in reinforcement learning by pid lagrangian methods

    Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. InInternational conference on machine learning, pages 9133–9143. PMLR, 2020

  47. [47]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015

  48. [48]

    Lhc data storage: Preparing for the challenges of run-3

    Maria Arsuaga-Rios, Vladimír Bahyl, Manuel Batalha, Cédric Caffy, Eric Cano, Niccolo Capitoni, Cristian Contescu, Michael Davis, David Fernandez Alvarez, Jaroslav Guenther, et al. Lhc data storage: Preparing for the challenges of run-3. InEPJ Web of Conferences, volume 251, page 02023. EDP Sciences, 2021. 15

  49. [49]

    Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set)

    Nour Moustafa and Jill Slay. Unsw-nb15: a comprehensive data set for network intrusion detection systems (unsw-nb15 network data set). In2015 Military Communications and Information Systems Conference (MilCIS), pages 1–6. IEEE, 2015

  50. [50]

    Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark

    Alexander Lavin and Subutai Ahmad. Evaluating real-time anomaly detection algorithms–the numenta anomaly benchmark. In2015 IEEE 14th international conference on machine learning and applications (ICMLA), pages 38–44. IEEE, 2015

  51. [51]

    The atlas trigger system for lhc run 3 and trigger performance in 2022.Journal of Instrumentation, 19(06): P06029, 2024

    Georges Aad, Erlend Aakvaag, B Abbott, Kira Abeling, Nils Julius Abicht, SH Abidi, Asmaa Aboulhorma, Halina Abramowicz, Henso Abreu, Yiming Abulaiti, et al. The atlas trigger system for lhc run 3 and trigger performance in 2022.Journal of Instrumentation, 19(06): P06029, 2024

  52. [52]

    A comparison of cpu and gpu implementations for the lhcb experiment run 3 trigger.Computing and Software for Big Science, 6(1):1, 2022

    Roel Aaij, Marta Adinolfi, Salvatore Aiola, S Akar, Johannes Albrecht, M Alexander, S Amato, Yasmine Amhis, F Archilli, M Bala, et al. A comparison of cpu and gpu implementations for the lhcb experiment run 3 trigger.Computing and Software for Big Science, 6(1):1, 2022

  53. [53]

    Review of machine learning for real-time analysis at the large hadron collider experiments alice, atlas, cms and lhcb.arXiv preprint arXiv:2506.14578, 2025

    Laura Boggia, Carlos Cocha, Fotis Giasemis, Joachim Hansen, Patin Inkaew, Kaare Endrup Iversen, Pratik Jawahar, Henrique Pineiro Monteagudo, Micol Olocco, Sten Astrand, et al. Review of machine learning for real-time analysis at the large hadron collider experiments alice, atlas, cms and lhcb.arXiv preprint arXiv:2506.14578, 2025

  54. [54]

    Top-down design of protein architectures with reinforcement learning.Science, 380(6642):266–273, 2023

    Isaac D Lutz, Shunzhi Wang, Christoffer Norn, Alexis Courbet, Andrew J Borst, Yan Ting Zhao, Annie Dosey, Longxing Cao, Jinwei Xu, Elizabeth M Leaf, et al. Top-down design of protein architectures with reinforcement learning.Science, 380(6642):266–273, 2023

  55. [55]

    Ex- tending group relative policy optimization to continuous control: A theoretical framework for robotic reinforcement learning.arXiv preprint arXiv:2507.19555, 2025

    Rajat Khanda, Mohammad Baqar, Sambuddha Chakrabarti, and Satyasaran Changdar. Ex- tending group relative policy optimization to continuous control: A theoretical framework for robotic reinforcement learning.arXiv preprint arXiv:2507.19555, 2025

  56. [56]

    Anomaly transformer: Time series anomaly detection with association discrepancy

    Jiehui Xu, Haixu Wu, Jianmin Wang, and Mingsheng Long. Anomaly transformer: Time series anomaly detection with association discrepancy. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=LzQQ89U1qm_

  57. [57]

    Lhc machine.Journal of instrumentation, 3(08):S08001– S08001, 2008

    Lyndon Evans and Philip Bryant. Lhc machine.Journal of instrumentation, 3(08):S08001– S08001, 2008

  58. [58]

    Physics guided rnns for modeling dynamical systems: A case study in simulating lake temperature profiles

    Xiaowei Jia, Jared Willard, Anuj Karpatne, Jordan Read, Jacob Zwart, Michael Steinbach, and Vipin Kumar. Physics guided rnns for modeling dynamical systems: A case study in simulating lake temperature profiles. InProceedings of the 2019 SIAM international conference on data mining, pages 558–566. SIAM, 2019

  59. [59]

    Physics- guided neural networks (pgnn): An application in lake temperature modeling

    Arka Daw, Anuj Karpatne, William D Watkins, Jordan S Read, and Vipin Kumar. Physics- guided neural networks (pgnn): An application in lake temperature modeling. InKnowledge guided machine learning, pages 353–372. Chapman and Hall/CRC, 2022

  60. [60]

    Physics-informed recurrent neural network for time dynamics in optical resonances.Nature computational science, 2(3):169–178, 2022

    Yingheng Tang, Jichao Fan, Xinwei Li, Jianzhu Ma, Minghao Qi, Cunxi Yu, and Weilu Gao. Physics-informed recurrent neural network for time dynamics in optical resonances.Nature computational science, 2(3):169–178, 2022

  61. [61]

    Near-optimal reinforcement learning in polynomial time

    Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2):209–232, 2002

  62. [62]

    An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

    Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes.Journal of Computer and System Sciences, 74(8):1309–1331, 2008

  63. [63]

    The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022

    Jincheng Mei, Wesley Chung, Valentin Thomas, Bo Dai, Csaba Szepesvari, and Dale Schuur- mans. The role of baselines in policy gradient optimization.Advances in Neural Information Processing Systems, 35:17818–17830, 2022

  64. [64]

    Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162,

    Hongyi Zhou, Kai Ye, Erhan Xu, Jin Zhu, Ying Yang, Shijin Gong, and Chengchun Shi. Demystifying group relative policy optimization: Its policy gradient is a u-statistic.arXiv preprint arXiv:2603.01162, 2026. 16

  65. [65]

    Simulated dataset tttohadronic_tunecp5_13tev-powheg-pythia8 in miniaod- sim format for 2016 collision data

    CMS Collaboration. Simulated dataset tttohadronic_tunecp5_13tev-powheg-pythia8 in miniaod- sim format for 2016 collision data. CERN Open Data Portal, 2024. URL https://opendata. cern.ch/record/67840. Data recorded in 2016 and published in 2024

  66. [66]

    About cms

    CERN Open Data Portal. About cms. https://opendata.cern.ch/docs/about-cms,

  67. [67]

    Accessed: 2026-01-03

  68. [68]

    Deciphering the nature of the Higgs sector, volume 2

    Daniel de Florian, Christophe Grojean, Fabio Maltoni, C Mariotti, A Nikitenko, M Pieri, P Savard, M Schumacher, R Aggleton, M Ahmad, et al.CERN Yellow Reports: Monographs, Vol 2 (2017): Handbook of LHC Higgs cross sections: 4. Deciphering the nature of the Higgs sector, volume 2. Cern, 2017

  69. [69]

    Morad Aaboud, G Aad, B Abbott, B Abeloos, DK Abhayasinghe, SH Abidi, OS AbouZeid, NL Abraham, H Abramowicz, H Abreu, et al. Search for the higgs boson produced in association with a vector boson and decaying into two spin-zero particles in the h→aa→4b channel in pp collisions at √s= 13 tev with the atlas detector.Journal of High Energy Physics, 2018(10): ...

  70. [70]

    Trigger throttling system for cms daq

    A Racz. Trigger throttling system for cms daq. Technical report, CERN, 2000. URL https: //cds.cern.ch/record/479701

  71. [71]

    Operation of the upgraded atlas central trigger processor during the lhc run 2.Journal of Instrumentation, 11(02):C02020–C02020, 2016

    Henrik Bertelsen, G Carrillo Montoya, P-O Deviveiros, T Eifert, G Galster, J Glatzer, S Haas, A Marzin, MV Silva Oliveira, T Pauly, et al. Operation of the upgraded atlas central trigger processor during the lhc run 2.Journal of Instrumentation, 11(02):C02020–C02020, 2016

  72. [72]

    Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

    Alex Graves. Long short-term memory.Supervised sequence labelling with recurrent neural networks, pages 37–45, 2012

  73. [73]

    Gate-variants of gated recurrent unit (gru) neural networks

    Rahul Dey and Fathi M Salem. Gate-variants of gated recurrent unit (gru) neural networks. In 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS), pages 1597–1600. IEEE, 2017

  74. [74]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

  75. [75]

    Time2Vec: Learning a Vector Representation of Time

    Seyed Mehran Kazemi, Rishab Goel, Sepehr Eghbali, Janahan Ramanan, Jaspreet Sahota, Sanjay Thakur, Stella Wu, Cathal Smyth, Pascal Poupart, and Marcus Brubaker. Time2vec: Learning a vector representation of time.arXiv preprint arXiv:1907.05321, 2019

  76. [76]

    Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4): 1748–1764, 2021

    Bryan Lim, Sercan Ö Arık, Nicolas Loeff, and Tomas Pfister. Temporal fusion transformers for interpretable multi-horizon time series forecasting.International journal of forecasting, 37(4): 1748–1764, 2021

  77. [77]

    Dinamo: Dynamic and inter- pretable anomaly monitoring for large-scale particle physics experiments.Machine Learning: Science and Technology, 6(3):035050, 2025

    Arsenii Gavrikov, Julián García Pardiñas, and Alberto Garfagnini. Dinamo: Dynamic and inter- pretable anomaly monitoring for large-scale particle physics experiments.Machine Learning: Science and Technology, 6(3):035050, 2025

  78. [78]

    Real-time anomaly detection at the l1 trigger of cms experiment.arXiv preprint arXiv:2411.19506, 2024

    Abhijith Gandrakota. Real-time anomaly detection at the l1 trigger of cms experiment.arXiv preprint arXiv:2411.19506, 2024

  79. [79]

    2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level- 1 Trigger

    CMS Collaboration. 2024 Data Collected with AXOL1TL Anomaly Detection at the CMS Level- 1 Trigger. Technical report, CERN, 2024. URLhttps://cds.cern.ch/record/2904695

  80. [80]

    Testing a neural network for anomaly detection in the cms global trigger test crate during run 3.Journal of Instrumentation, 19(03):C03029, 2024

    Noah Zipper and CMS collaboration. Testing a neural network for anomaly detection in the cms global trigger test crate during run 3.Journal of Instrumentation, 19(03):C03029, 2024

Showing first 80 references.